A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems

A Cost-Effectve Strategy for Intermedate Data Storage n Scentfc Cloud Workflow Systems Dong Yuan, Yun Yang, Xao Lu, Jnjun Chen Faculty of Informaton and Communcaton Technologes, Swnburne Unversty of Technology Hawthorn, Melbourne, Australa 3122 {dyuan, yyang, xlu, jchen}@swn.edu.au Abstract Many scentfc workflows are data ntensve where a large volume of ntermedate data s generated durng ther executon. Some valuable ntermedate data need to be stored for sharng or reuse. Tradtonally, they are selectvely stored accordng to the system storage capacty, determned manually. As dong scence on cloud has become popular nowadays, more ntermedate data can be stored n scentfc cloud workflows based on a pay-foruse model. In ths paper, we buld an Intermedate data Dependency Graph (IDG) from the data provenances n scentfc workflows. Based on the IDG, we develop a novel ntermedate data storage strategy that can reduce the cost of the scentfc cloud workflow system by automatcally storng the most approprate ntermedate datasets n the cloud storage. We utlse Amazon s cost model and apply the strategy to an astrophyscs pulsar searchng scentfc workflow for evaluaton. The results show that our strategy can reduce the overall cost of scentfc cloud workflow executon sgnfcantly. Keywords - data storage; cost; scentfc workflow; cloud computng I. INTRODUCTION Scentfc applcatons are usually complex and datantensve. In many felds, lke astronomy [9], hgh-energy physcs [17] and bo-nformatcs [19], scentsts need to analyse terabytes of data ether from exstng data resources or collected from physcal devces. The scentfc analyses are usually computaton ntensve, hence takng a long tme for executon. Workflow technologes can be facltated to automate these scentfc applcatons. Accordngly, scentfc workflows are typcally very complex. They usually have a large number of tasks and need a long tme for executon. Durng the executon, a large volume of new ntermedate data wll be generated [10]. They could be even larger than the orgnal data and contan some mportant ntermedate results. After the executon of a scentfc workflow, some ntermedate data may need to be stored for future use because: 1) scentsts may need to re-analyse the results or apply new analyses on the ntermedate data; 2) for collaboraton, the ntermedate results are shared among scentsts from dfferent nsttutons and the ntermedate data can be reused. Storng valuable ntermedate data can save ther regeneraton cost when they are reused, not to menton the watng tme saved for regeneraton. Gven the large sze of the data, runnng scentfc workflow applcatons usually need not only hgh performance computng resources but also massve storage [10]. Nowadays, popular scentfc workflows are often deployed n grd systems [17] because they have hgh performance and massve storage. However, buldng a grd system s extremely expensve and t s normally not open for scentsts all over the world. The emergence of cloud computng technologes offers a new way to develop scentfc workflow systems n whch one research topc s cost-effectve strateges for storng ntermedate data. In late 2007 the concept of cloud computng was proposed [23] and t s deemed as the next generaton of IT platforms that can delver computng as a knd of utlty [8]. Foster et al. made a comprehensve comparson of grd computng and cloud computng [12]. Cloud computng systems provde the hgh performance and massve storage requred for scentfc applcatons n the same way as grd systems, but wth a lower nfrastructure constructon cost among many other features, because cloud computng systems are composed of data centres whch can be clusters of commodty hardware [23]. Research nto dong scence and data-ntensve applcatons on the cloud has already commenced [18], such as early experences lke Nmbus [15] and Cumulus [22] projects. The work by Deelman et al. [11] shows that cloud computng offers a cost-effectve soluton for data-ntensve applcatons, such as scentfc workflows [14]. Furthermore, cloud computng systems offer a new model that scentsts from all over the world can collaborate and conduct ther research together. Cloud computng systems are based on the Internet, and so are the scentfc workflow systems deployed n the cloud. Scentsts can upload ther data and launch ther applcatons on the scentfc cloud workflow systems from everywhere n the world va the Internet, and they only need to pay for the resources that they use for ther applcatons. As all the data are managed n the cloud, t s easy to share data among scentsts. Scentfc cloud workflows are deployed n a cloud computng envronment, where all the resources need to be pad for use. For a scentfc cloud workflow system, storng all the ntermedated data generated durng workflow executons may cause a hgh storage cost. On the contrary, f we delete all the ntermedate data and regenerate them every tme when needed, the computaton 1

cost of the system may well be very hgh too. The ntermedate data management s to reduce the total cost of the whole system. The best way s to fnd a balance that selectvely store some popular datasets and regenerate the rest of them when needed. In ths paper, we propose a novel strategy for the ntermedate data storage of scentfc cloud workflows to reduce the overall cost of the system. The ntermedate data n scentfc cloud workflows often have dependences. Along workflow executon, they are generated by the tasks. A task can operate on one or more datasets and generate new one(s). These generaton relatonshps are a knd of data provenance. Based on the data provenance, we create an Intermedate data Dependency Graph (IDG), whch records the nformaton of all the ntermedate datasets that have ever exsted n the cloud workflow system, no matter whether they are stored or deleted. Wth the IDG, the system knows how the ntermedate datasets are generated and can further calculate ther generaton cost. Gven an ntermedate dataset, we dvde ts generaton cost by ts usage rate, so that ths cost (the generaton cost per unt tme) can be compared wth ts storage cost per tme unt, where a dataset s usage rate s the tme between every usage of ths dataset that can be obtaned from the system log. Our strategy can automatcally decde whether an ntermedate dataset should be stored or deleted n the cloud system by comparng the generaton cost and storage cost, and no matter ths ntermedate dataset s a new dataset, regenerated dataset or stored dataset n the system. The remnder of ths paper s organsed as follows. Secton 2 gves a motvatng example and analyses the research problems. Secton 3 ntroduces some mportant related concepts to our strategy. Secton 4 presents the detaled algorthms n our strategy. Secton 5 demonstrates the smulaton results and the evaluaton. Secton 6 dscusses the related work. Secton 7 addresses our conclusons and future work. II. MOTIVATING EXAMPLE AND PROBLEM ANALYSIS 2.1 Motvatng example Scentfc applcatons often need to process a large amount of data. For example, Swnburne Astrophyscs group has been conductng a pulsar searchng survey usng the observaton data from Parkes Rado Telescope (http://astronomy.swn.edu.au/pulsar/), whch s one of the most famous rado telescopes n the world (http://www.parkes.atnf.csro.au). Pulsar searchng s a typcal scentfc applcaton. It contans complex and tme consumng tasks and needs to process terabytes of data. Fg. 1 depcts the hgh level structure of a pulsar searchng workflow. Fgure 1. Pulsar searchng workflow At the begnnng, raw sgnal data from Parkes Rado Telescope are recorded at a rate of one ggabyte per second by the ATNF 1 Parkes Swnburne Recorder (APSR). Depends on dfferent areas n the unverse that the researchers want to conduct the pulsar searchng survey, the observaton tme s normally from 4 mnutes to one hour. Recordng from the telescope n real tme, these raw data fles have data from multple beams nterleaved. For ntal preparaton, dfferent beam fles are extracted from the raw data fles and compressed. They are 1GB to 20GB each n sze, depends on the observaton tme. The beam fles contan the pulsar sgnals whch are dspersed by the nterstellar medum. De-dsperson s to counteract ths effect. Snce the potental dsperson source s unknown, a large number of de-dsperson fles wll be generated wth, dfferent dsperson trals. In the current pulsar searchng 1 ATNF refers to the Australan Telescope Natonal Faclty. survey, 1200 s the mnmum number of the dsperson trals. Based on the sze of the nput beam fle, ths dedsperson step wll take 1 to 13 hours to fnsh and generate up to 90GB of de-dsperson fles. Furthermore, for bnary pulsar searchng, every de-dsperson fle wll need another step of processng named accelerate. Ths step wll generate the accelerated de-dsperson fles wth the smlar sze n the last de-dsperson step. Based on the generated de-dsperson fles, dfferent seekng algorthms can be appled to search pulsar canddates, such as FFT Seekng, FFA Seekng, and Sngle Pulse Seekng. For a large nput beam fle, t wll take more than one hour to seek the 1200 de-dsperson fles. A canddate lst of pulsars wll be generated after the seekng step whch s saved n a text fle. Furthermore, by comparng the canddates generated from dfferent beam fles n a same tme sesson, some nterference may be detected and some canddates may be elmnated. Wth the fnal pulsar canddates, we need to go back to the beam fles or the de- 2

dsperson fles to fnd ther feature sgnals and fold them to XML fles. At last, the XML fles wll be vsual dsplayed to researchers for makng decsons on whether a pulsar has been found or not. As descrbed above, we can see that ths pulsar searchng workflow s both computaton and data ntensve. It s currently runnng on Swnburne hgh performance supercomputng faclty (http://astronomy.swnburne.edu.au/supercomputng/). It needs long executon tme and a large amount of ntermedate data s generated. At present, all the ntermedate data are deleted after havng been used, and the scentsts only store the raw beam data, whch are extracted from the raw telescope data. Whenever there are needs of usng the ntermedate data, the scentsts wll regenerate them based on the raw beam fles. The ntermedated data are not stored, manly because the supercomputer s a shared faclty that can not offer unlmted storage capacty to hold the accumulated terabytes of data. However, some ntermedate data are better to be stored. For example, the de-dsperson fles are frequently used ntermedate data. Based on them, the scentsts can apply dfferent seekng algorthms to fnd potental pulsar canddates. Furthermore, some ntermedate data are derved from the de-dsperson fles, such as the results of the seek algorthms and the pulsar canddate lst. If these data are reused, the de-dsperson fles wll also need to be regenerated. For the large nput beam fles, the regeneraton of the de-dsperson fles wll take more than 10 hours. It not only delays the scentsts from conductng ther experments, but also wastes a lot of computaton resources. On the other hand, some ntermedate data may not need to be stored. For example, the accelerated de-dsperson fles, whch are generated by the accelerate step. The accelerate step s an optonal step that s only for the bnary pulsar searchng. Not all pulsar searchng processes need to accelerate the de-dsperson fles, so the accelerated de-dsperson fles are not that often used. In lght of ths and gven the large sze of these data, they are not worth to store as t would be more cost effectve to regenerate them from the de-dsperson fles whenever they are used. 2.2 Problem analyss Tradtonally, scentfc workflows are deployed on the hgh performance computng facltes, such as clusters and grds. Scentfc workflows are often complex wth huge ntermedate data generated durng ther executon. How to store these ntermedate data s normally decded by the scentsts who use the scentfc workflows. Ths s because the clusters and grds only serve for certan nsttutons. The scentsts may store the ntermedate data that are most valuable to them, based on the storage capacty of the system. However, n many scentfc workflow systems, the storage capactes are lmted, such as the pulsar searchng workflow we ntroduced. The scentsts have to delete all the ntermedate data because of the storage lmtaton. Ths bottleneck of storage can be avoded f we run scentfc workflows n the cloud. In a cloud computng envronment, theoretcally, the system can offer unlmted storage resources. All the ntermedate data generated by scentfc cloud workflows can be stored, f we are wllng to pay for the requred resources. However, n scentfc cloud workflow systems, whether to store ntermedate data or not s not an easy decson anymore. 1) All the resources n the cloud carry certan costs, so ether storng or generatng an ntermedate dataset, we have to pay for the resources used. The ntermedate datasets vary n sze, and have dfferent generaton cost and usage rate. Some of them may often be used whlst some others may be not. On one hand, t s most lkely not cost effectve to store all the ntermedate data n the cloud. On the other hand, f we delete them all, regeneraton of frequently used ntermedate datasets mposes a hgh computaton cost. We need a strategy to balance the generaton cost and the storage cost of the ntermedate data, n order to reduce the total cost of the scentfc cloud workflow system. In ths paper, gven the large capacty of data centre and the consderaton of cost effectveness, we assumng that all the ntermedate data are stored n one data centre, therefore, data transfer cost s not consdered. 2) The scentsts can not predct the usage rate of the ntermedate data anymore. For a sngle research group, f the data resources of the applcatons are only used by ts own scentsts, the scentsts may predct the usage rate of the ntermedate data and decde whether to store or delete them. However, the scentfc cloud workflow system s not developed for a sngle scentst or nsttuton, rather, developed for scentsts from dfferent nsttutons to collaborate and share data resources. The users of the system could be anonymous from the Internet. We must have a strategy storng the ntermedate data based on the needs of all the users that can reduce the cost of the whole system. Hence, for scentfc cloud workflow systems, we need a strategy that can automatcally select and store the most approprate ntermedate datasets. Furthermore, ths strategy should be cost effectve that can reduce the total cost of the whole system. III. COST ORIENTED INTERMEDIATE DATA STORAGE IN SCIENTIFIC CLOUD WORKFLOWS 3.1 Data management n scentfc cloud workflow systems In a cloud computng system, applcaton data are stored n large data centres. The cloud users vst the system va the Internet and upload the data to conduct ther applcatons. All the applcaton data are stored n the cloud storage and managed by the cloud system ndependent of users. As tme goes on and the number of cloud users ncreases, the volume of data stored n cloud wll become huge. Ths makes the data management n cloud computng system a very challengng job. 3

Scentfc cloud workflow system s the workflow system for scentsts to run ther applcatons n the cloud. As depcted n Fgure 2, t has many dfferences wth the tradtonal scentfc workflow systems n data Fgure 2. Structure of data management n scentfc cloud workflow system management. The most mportant ones are as follows. 1) For scentfc cloud workflows, all the applcaton data are managed n the cloud. To launch ther workflows, the scentsts have to upload ther applcaton data to the cloud storage va a Web portal. Ths requres data management to be automatc. 2) The scentfc cloud workflow system has a cost model. The scentsts have to pay for the resources used for conductng ther applcatons. Hence, the data management has to be cost orented. 3) The scentfc cloud workflow system s based on the Internet, where the applcaton data are shared and reused among the scentsts world wde. For the data reanalyses and regeneratons, data provenance s more mportant n scentfc cloud workflows. In general, there are two types of data stored n the cloud storage, nput data and ntermedate data ncludng result data. Frst, nput data are the data uploaded by users, and n the scentfc applcatons they also can be the raw data collected from the devces. These data are the orgnal data for processng or analyss that are usually the nput of the applcatons. The most mportant feature of these data s that f they were deleted, they could not be regenerated by the system. Second, ntermedate data are the data newly generated n the cloud system whle the applcaton runs. These data save the ntermedate computaton results of the applcaton that wll be used n the future executon. In general, the fnal result data of the applcatons are a knd of ntermedate data, because the result data n one applcaton can also be used n other applcatons. When further operatons apply on the result data, they become ntermedate data. Hence, the ntermedate data are the data generated based on ether the nput data or other ntermedate data, and the most mportant feature s that they can be regenerated f we know ther provenance. For the nput data, the users wll decde whether they should be stored or deleted, snce they can not be regenerated once deleted. For the ntermedate data, ther storage status can be decded by the system, snce they can be regenerated. Hence, n ths paper we develop a strategy for ntermedate data storage that can sgnfcantly reduce the cost of scentfc cloud workflow system. 3.2 Data provenance and Intermedate data Dependency Graph (IDG) Scentfc workflows have many computaton and data ntensve tasks that wll generate many ntermedate datasets of consderable sze. There are dependences exst among the ntermedate datasets. Data provenance n workflows s a knd of mportant metadata, n whch the dependences between datasets are recorded [21]. The dependency depcts the dervaton relatonshp between workflow ntermedate datasets. For scentfc workflows, data provenance s especally mportant, because after the executon, some ntermedate datasets may be deleted, but sometmes the scentsts have to regenerate them for ether reuse or reanalyss [7]. Data provenance records the nformaton of how the ntermedate datasets were generated, whch s very mportant for the scentsts. Furthermore, regeneraton of the ntermedate datasets from the nput data may be very tme consumng, and therefore carry a hgh cost. Wth data provenance nformaton, the regeneraton of the demandng dataset may start from some stored ntermedated datasets nstead. In the scentfc cloud workflow system, data provenance s recorded whle the workflow executon. Takng advantage of data provenance, we can buld an IDG based on data provenance. All the ntermedate datasets once generated n the system, whether stored or deleted, ther references are recorded n the IDG. Fgure 3. A smple Intermedate data Dependency Graph (IDG) 4

IDG s a drected acyclc graph, where every node n the graph denotes an ntermedate dataset. Fgure 3 shows us a smple IDG, dataset d 1 s ponted to d 2 means d 1 s used to generate d 2 ; dataset d 2 and d 3 are ponted to d 4 means d 2 and d 3 are used together to generate d 4 ; and d 5 s ponted to d 6 and d 7 means d 5 s used to generate ether d 6 or d 7 based on dfferent operatons. In the IDG, all the ntermedate datasets provenances are recorded. When some of the deleted ntermedate datasets need to be reused, we do not need to regenerate them from the orgnal nput data. Wth the IDG, the system can fnd the predecessor datasets of the demandng data, so they can be regenerated from ther nearest exstng predecessor datasets. 3.3 Cost model Wth the IDG, gven any ntermedate datasets that ever occurred n the system, we know how to regenerate t. However, n ths paper, we am at reducng the total cost of managng the ntermedate data. In a cloud computng envronment, f the users want to deploy and run applcatons, they need to pay for the resources used. The resources are offered by cloud servce provders, who have ther cost models to charge the users. In general, there are two basc types of resources n cloud computng: storage and computaton. Popular cloud servces provders cost models are based on these two types of resources [1]. Furthermore, the cost of data transfer s also consdered, such as n Amazon s cost model. In [11], the authors state that a cost-effectve way of dong scence n the cloud s to upload all the applcaton data to the cloud and run all the applcatons n the cloud servces. So we assume that the scentsts upload all the nput data to the cloud to conduct ther experments. Because transferrng data wthn one cloud servce provder s facltes s usually free, the data transfer cost of managng ntermedate data durng the workflow executon s not counted. In ths paper, we defne our cost model for managng the ntermedate data n a scentfc cloud workflow system as follows: Cost=C+S, where the total cost of the system, Cost, s the sum of C, whch s the total cost of computaton resources used to regenerate the ntermedate data, and S, whch s the total cost of storage resources used to store the ntermedate data. For the resources, dfferent cloud servce provders have dfferent prces. In ths paper, we use Amazon servces prce as follows: $0.15 per Ggabyte per month for the storage resources. $0.1 per CPU hour for the computaton resources. Furthermore, we denote these two prces as CostS and CostC for the algorthms respectvely. To utlse the cost model, we defne some mportant attrbutes for the ntermedate datasets n the IDG. For ntermedate dataset d, ts attrbutes are denoted as: <sze, flag, t p, t, pset, fset, CostR>, where sze, denotes the sze of ths dataset; flag, denotes the status whether ths dataset s stored or deleted n the system; t p, denotes the tme of generatng ths dataset from ts drect predecessor datasets; t, denotes the usage rate, whch s the tme between every usage of d n the system. In tradtonal scentfc workflows, t can be defned by the scentsts, who use ths workflow collaboratvely. However, a scentfc cloud workflow system s based on the Internet wth large number of users, as we dscussed before, d can not be defned by users. It s a forecastng value from the dataset s usage hstory recorded n the system logs. t s a dynamc value that changes accordng to d s real usage rate n the system. pset, s the set of references of all the deleted ntermedate datasets n the IDG that lnked to d, whch s shown n Fgure 4. If we want to regenerate d, d.pset contans all the datasets that need to be regenerated beforehand. Hence, the generaton cost of d can be denoted as: gencost( d ) = ( d. t p + d d. pset d j. t p ) CostC ; fset, s the set of references of all the deleted ntermedate datasets n the IDG that are lnked by d, whch s shown n Fgure 4. If d s deleted, to regenerate any datasets n d.fset, we have to regenerate d frst. In another word, f the storage status of d has changed, the generaton cost of all the datasets n d.fset wll be affected by gencost(d );... Fgure 4. A segment of IDG CostR, s d s cost rate, whch means the average cost per tme unt of the dataset d n the system, n ths paper we use hour as tme unt. If d s a stored dataset, d. CostR = d. sze CostS. If d s a deleted dataset n the system, when we need to use d, we have to regenerate t. So we dvde the generaton cost of d by the tme between ts usages and use ths value as the cost rate of d n the system. d. CostR = gencost( d) d. t. When the storage status of d s changed, ts CostR wll be changed correspondngly. Hence, the system cost rate of managng ntermedate data s the sum of of all the ntermedate datasets, whch s d IDG ( d. CostR). Gven tme duraton, denoted as [T 0, T n ], the total system cost s the ntegral of the system cost rate n ths duraton as a functon of tme t, whch s: j... 5

Total _ Cost ( ( d. CostR) ) = T = n t T d IDG dt 0 The goal of our ntermedate data management s to reduce ths cost. In the next secton, we wll ntroduce a dependency based ntermedate data storage strategy, whch selectvely stores the ntermedate datasets to reduce the total cost of the scentfc cloud workflow system. IV. DEPENDENCY BASED INTERMEDIATE DATA STORAGE STRATEGY The IDG records the references of all the ntermedate datasets and ther dependences that ever occurred n the system, some datasets may be stored n the system, and others may be deleted. When new datasets are generated n the system, ther nformaton s added to the IDG at the frst tme. Our dependency based ntermedate data storage strategy s developed based on the IDG, and appled at workflow runtme. It can dynamcally store the essental ntermedate datasets durng workflow executon. The strategy contans three algorthms descrbed n ths secton. 4.1 Algorthm for decdng newly generated ntermedate datasets storage status Suppose d 0 s a newly generated ntermedate dataset. Frst, we add ts nformaton to the IDG. We fnd the provenance datasets of d 0 n the IDG, and add edges ponted to d 0 from these datasets. Then we ntalse ts attrbutes. As d 0 does not have a usage hstory yet, we use the average value n the system as the ntal value of d 0 s usage rate. Next, we check f d 0 needs to be stored or not. As d 0 s newly added n the IDG, t does not have successor datasets n the IDG, whch means no ntermedate datasets are derved from d 0 at ths moment. For decdng whether to store or delete d 0, we only compare the generaton cost and storage cost of d 0 tself, whch are gencost ( d0) d0. t and d0. sze CostS. If the cost of generaton s larger than the cost of storng t, we save d 0 and set d0. CostR = d0. sze CostS, otherwse we delete d 0 and set d 0. CostR = gencost( d0) d0. t. The algorthm s shown n Fgure 5. ( d. t + d. t ) CostC; gencost( d0) = d d0. pset p 0 gencost d ) d. t > d. sze CostS ( 0 0 0 d0. CostR = d0. sze CostS; p d CostR = gencost( d ) d. ; 0. 0 0 t Fgure 5. Algorthm for handlng newly generated datasets In ths algorthm, we guarantee that all the ntermedate datasets chosen to be stored are necessary, whch means that deletng anyone of them would ncrease the cost to the system, snce they all have a hgher generaton cost than storage cost. 4.2 Algorthm for managng stored ntermedate datasets The usage rate t of a dataset s an mportant parameter that determnes ts storage status. Snce t s a dynamc value that may change at any tme, we have to dynamcally check the stored ntermedate datasets n the system that whether they stll need to be stored. For an ntermedate dataset d 0 that s stored n the system, we set a threshold tme t θ, where d0. tθ = gencost( d0) ( d0. sze CostS). Ths threshold tme ndcates how long ths dataset can be stored n the system wth the cost of generatng t. If d 0 has not been used for the tme of t θ, we wll check whether t should be stored anymore. If we delete stored ntermedate dataset d 0, the system cost rate s reduced by d 0 s storage cost rate, whch s d0. sze CostS. Meanwhle, the ncrease of the system cost rate s the sum of the generaton cost rate of d 0 tself, whch s gencost ( d0 ) d0. t, and the ncreased generaton cost rates of all the datasets n d 0.fSet caused by deletng d 0, whch s d d fset ( gencost d d t) 0. ( 0 ).. We compare d 0 s storage cost rate and generaton cost rate to decde whether d 0 should be stored or not. The detaled algorthm s shown n Fgure 6. Lemma: The deleton of stored ntermedate dataset d 0 n the IDG does not affect the stored datasets adjacent to d 0, where the stored datasets adjacent to d 0 means the datasets that drectly lnk to d 0 or d 0.pSet, and the datasets that are drectly lnked by d 0 or d 0.fSet. Proof: 1) Suppose d p s a stored dataset drectly lnked to d 0 or d 0.pSet. Snce d 0 s deleted, d 0 and d 0.fSet are added to d p.fset. So the new generaton cost rate of d p n the system gencost d p ) d p. t + d d fset d d. fset gencost ( d p ) d. t s ( U ( ) p. U 0 0, and t s larger than before, whch was 6

( gencost ( d ) d t) ( d p ) d p. t + d d p fset p. Hence d p gencost.. stll needs to be stored; 2) Suppose d f s a stored dataset drectly lnked by d 0 or d 0.fSet. Snce d 0 s deleted, d 0 and d 0.pSet are added to d f.pset. So the new generaton cost of d f s gencost( d f ) = ( d f. t p + d d pset d d pset d t p ) CostC f. U 0U 0.., and t s larger than before, whch was ( d. t + d t ) CostC gencost( d f ) = f p d d f. pset. p. Because of the ncrease of gencost(d f ), the generaton cost rate of d f n the system s larger than before, whch was gencost ( d f ) d f. t + d d fset ( gencost d f d t) f. ( ).. Hence d f stll needs to be stored. Because of 1) and 2), the Lemma holds. Input: a stored ntermedate dataset d 0 ; an IDG ; Output: storage strategy of d 0 ; ( ) ; gencost( d0 ) = d.. //calculate d 0 s generaton cost 0. d t d0 t CostC d pset p + p f ( gencost( d0 ) d0. t + d ( gencost d d t) d sze CostS ) //compare d 0 s storage and generaton cost rate d fset ( ). >. 0. 0 0 T ' = T + d 0. t θ ; //set the next checkng tme T, T s the current system tme, t s the duraton d 0 should be stored θ else d 0.flag= deleted ; //decde to delete d 0 d 0. CostR = gencost( d0 ) d0. t ; //change d 0 s cost rate for (every d n d 0.fSet ) //change the cost rates of all the datasets n d 0.fSet d. CostR = d. CostR + gencost( d0 ) d. t ; //cost rate ncreases wth the generaton cost of d 0 update IDG & execute store or delete of d 0 ; Fgure 6. Algorthm for checkng stored ntermedate datasets By applyng the algorthm of checkng the stored ntermedate datasets, we can stll guarantee that all the datasets we have kept n the system are necessary to be stored. Furthermore, when the deleted ntermedate datasets are regenerated, we also need to check whether to store or delete them as dscussed next. 4.3 Algorthm for decdng the regenerated ntermedate datasets storage status The IDG s a dynamc graph where the nformaton of new ntermedate datasets may jon at anytme. Although the algorthms n the above two sub-sectons can guarantee that the stored ntermedate datasets are all necessary, these stored datasets may not be the most cost effectve. Intally deleted ntermedate datasets may need to be stored as the IDG expands. Suppose d 0 s a regenerated ntermedate dataset n the system, whch has been deleted before. After been used, we have to recalculate d 0 s storage status, as well as the stored datasets adjacent to d 0 n the IDG. Theorem: If regenerated ntermedate dataset d 0 s stored, only the stored datasets adjacent to d 0 n the IDG may need to be deleted to reduce the system cost. Proof: 1) Suppose d p s a stored dataset drectly lnked to d 0 or d 0.pSet. Snce d 0 s stored, d 0 and d 0.fSet need to be removed from d p.fset. So the new generaton cost rate of d p n the system s gencost ( d p ) d p. t + d d fset d d fset ( gencost d p d t) p. 0 0. ( )., and t s smaller than before, whch was gencost ( d p ) d p. t + d d fset ( gencost d p d t) p. ( ).. If the new generaton cost rate s smaller than the storage cost rate of d p, d p would be deleted. The rest of the stored ntermedate datasets are not affected by the deleton of d p, because of the Lemma ntroduced before. 2) Suppose d f s a stored dataset drectly lnked by d 0 or d 0.fSet. Snce d 0 s stored, d 0 and d 0.pSet need to be removed from d f.pset. So the new generaton cost of d f s gencost( d f ) = ( d f. t p + d d pset d d pset d t p ) CostC f. 0 0.., and t s smaller than before, whch was gencost( d f ) = ( d f. t p + d d pset d t p ) CostC f... Because of the reduce of gencost(d f ), the generaton cost rate of d f n the system s smaller than before, whch was gencost ( d f ) d f. t + d d fset ( gencost d f d t) f. ( ).. If the new generaton cost rate s smaller than the storage cost rate of d f, d f would be deleted. The rest of the stored ntermedate datasets are not affected by the deleton of d f, because of the Lemma ntroduced before. Because of 1) and 2), the Theorem holds. If we store regenerated ntermedate dataset d 0, the cost rate of the system ncreases wth d 0 s storage cost rate, whch s d0. sze CostS. Meanwhle, the reducton of the system cost rate may be resulted from three aspects: (1) The generaton cost rate of d 0 tself, whch s gencost d ) d. t ( 0 0 ; (2) The reduced generaton cost rates of all the datasets n d 0.fSet caused by storng d 0, whch s d d fset ( gencost ( d ) d. t) ; 0. 0 (3) As ndcated n the Theorem, some stored datasets adjacent to d 0 may be deleted that reduces the cost to the system. 7

We wll compare the ncrease and reducton of the system cost rate to decde whether d 0 should be stored or not. The detaled algorthm s shown n Fgure 7. = ( gencost( d ) d. t) d. sze CostS; gencost( d ) d0. t + 0 0 d d0. fset 0 d. CostR = d. CostR gencost( d0) d. t ; ( gencost( d ) d. t) < d sze CostS gencost d ) d. t +. ( j j dm d j. fset d. CostR gencost( d ) d. t; j = j dm. CostR = dm. CostR + gencost( d j ) dm. t; + + = + d sze CostS gencost( d ) d. t gencost d ) d. t j j. j j d m d j. fset j + ( gencost( d ) d. t) < d. sze CostS ( k k dn dk. fset d. CostR gencost( d ) d. t; k = k k k m n j k ( gencost( d ) d. t); dn. CostR = dn. CostR + gencost( d j ) dn. t; = + dk. sze CostS gencost( dk ) dk. t d d. fset ( gencost( dk ) dn. t); n k + + + > 0 j m Fgure 7. Algorthm for checkng deleted ntermedate datasets By applyng the algorthm of checkng the regenerated ntermedate datasets, we can not only guarantee that all the datasets we have kept n the system are necessary to be stored, but also any changes of the datasets storage status wll reduce the total system cost. V. EVALUATION 5.1 Smulaton envronment and strateges The ntermedate data storage strategy we proposed n ths paper s generc. It can be used n any scentfc workflow applcatons. In ths secton, we deploy t to the pulsar searchng workflow descrbed n Secton 2. We use the real world statstcs to conduct our smulaton on Swnburne hgh performance supercomputng faclty and demonstrate how our strategy works n storng the ntermedate datasets of the pulsar searchng workflow. To smulate the cloud computng envronment, we set up VMware software (http://www.vmware.com/) on the physcal servers and create vrtual clusters as the data centre. Furthermore, we set up the Hadoop fle system (http://hadoop.apache.org/) n the data centre to manage the applcaton data. In the pulsar example, durng the workflow executon, sx ntermedate datasets are generated. The IDG of ths pulsar searchng workflow s shown n Fgure 8, as well as the szes and generaton tmes of these ntermedate datasets. The generaton tmes of the datasets are from runnng ths workflow on Swnburne Supercomputer, and for smulaton, we assume that n the cloud system, the generaton tmes of these ntermedate datasets are the same. Furthermore, we assume that the prces of cloud servces follow Amazon s cost model,.e. $0.1 per CPU hour for computaton and $0.15 per ggabyte per month for storage. To evaluate the performance of our strategy, we run fve smulaton strateges together and compare the total cost of the system. The strateges are: 1) Store all the ntermedate datasets n the system; 2) Delete all the ntermedate datasets, and regenerate them whenever needed; 3) Store the datasets that have hgh generaton cost; 4) Store the datasets that are most often used; and 5) Our strategy to dynamcally decde whether a dataset should be stored or deleted. 8

Fgure 8. IDG of pulsar searchng workflow 5.2 Smulaton results We run the smulatons based on the estmated usage rate of every ntermedate dataset. From Swnburne astrophyscs research group, we understand that the dedsperson fles are the most useful ntermedate dataset. Based on these fles, many acceleratng and seekng methods can be used to search pulsar canddates. Hence, we set the de-dsperson fles to be used once every 4 days, and rest of the ntermedate datasets to be used once every 10 days. Based on ths settng, we run the above mentoned fve smulaton strateges and calculate the total costs of the system for ONE branch of the pulsar searchng workflow of processng ONE pece of observaton data n 50 days whch s shown n Fgure 9. Total cost of 50 days 50 Store all 40 Store none Cost ($) 30 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Days Store hgh generaton cost datasets Store often used datasets Dependency based strategy Fgure 9. Total cost of pulsar searchng workflow wth Amazon s cost model From Fgure 9 we can see that: 1) The cost of the store all strategy s a straght lne, because n ths strategy, all the ntermedate datasets are stored n the cloud storage that s charged at a fxed rate, and there s no computaton cost requred; 2) The cost of the store none strategy s a fluctuated lne because n ths strategy all the costs are computaton cost of regeneratng ntermedate datasets. For the days that have fewer requests of the data, the cost s low, otherwse, the cost s hgh; 3-5) For the remanng three strateges, the cost lnes are only a lttle fluctuated and the cost s much lower than the store all and store none strateges. Ths s because the ntermedate datasets are partally stored. As ndcated n Fgure 9 we can draw the concluson that: 1) Nether storng all the ntermedate datasets nor deletng them all s a cost-effectve way for ntermedate data storage; 2) Our dependency based strategy performs the most cost effectve to store the ntermedate datasets. Furthermore, back to the pulsar searchng workflow example, Table 1 shows how the fve strateges store the ntermedate datasets n detal. TABLE 1. PULSAR SEARCHING WORKFLOW S INTERMEDIATE DATASETS STORAGE STATUS IN 5 STRATEGIES Datasets Extracte De-dsperson Accelerated dedsperson fles canddates Pulsar Seek results Strateges d beam fles XML fles Store all Stored Stored Stored Stored Stored Stored Store none Deleted Deleted Deleted Deleted Deleted Deleted Store hgh generaton cost datasets Deleted Stored Stored Deleted Deleted Stored Store often used datasets Deleted Stored Deleted Deleted Deleted Deleted Dependency based strategy Deleted Stored (was deleted ntally) Deleted Stored Deleted Stored 9

Snce the ntermedate datasets of ths pulsar searchng workflow s not complcate, we can do some straghtforward analyses on how to store them. For the accelerated de-dsperson fles, although ts generaton cost s qute hgh, comparng to ts huge sze, t s not worth to store them n the cloud. However, n the strategy of store hgh generaton cost datasets, the accelerated de-dsperson fles are chosen to be stored. Furthermore, for the fnal XML fles, they are not very often used, but comparng to the hgh generaton cost and small sze, they should be stored. However, n the strategy of store often used datasets, these fles are not chosen to be stored. Generally speakng, our dependency based strategy s the most approprate strategy for the ntermedate data storage whch s also dynamc. From Table 1 we can see, our strategy dd not store the de-dsperson fles at begnnng, but stored them after ther regeneraton. In our strategy, every storage status change of the datasets would reduce the total system cost rate, where the strategy can gradually close the mnmum cost of system. One mportant factor that affects our dependency based strategy s the usage rate of the ntermedate datasets. In a system, f the usage rage of the ntermedate datasets s very hgh, the generaton cost of the datasets s very hgh, correspondngly these ntermedate datasets are more tend to be stored. On the contrary, n a very low ntermedate datasets usage rate system, all the datasets are tend to be deleted. In Fgure 9 s smulaton, we set the datasets usage rate on the borderlne that makes the total cost equvalent to the strateges of store all and store none. Under ths condton, the ntermedate datasets have no tendency to be stored or deleted, whch can objectvely demonstrate our strategy s effectveness on reducng the system cost. Next we wll also demonstrate the performance of our strategy n the stuatons under dfferent usage rates of the ntermedate datasets. Total cost of 50 days Store all Store none Store hgh generaton cost datasets Store often used datasets Dependency based strategy Cost ($) 50 40 30 20 10 Cost ($) 50 40 30 20 10 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 Days 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 Days Fgure 10. Cost of pulsar searchng workflow wth dfferent ntermedate datasets usage rates Fgure 10 (a) shows the cost of the system wth the usage rate of every dataset doubled n the pulsar workflow. From the fgure we can see, when the datasets usage rates are hgh, the strategy of store none becomes hghly cost neffectve, because the frequent regeneraton of the ntermedate datasets causes a very hgh cost to the system. In contrast, our strategy s stll the most costeffectve one that the total system cost only ncreases slghtly. It s not very much nfluenced by the datasets usage rates. For the store all strategy, although t s not nfluenced by the usage rate, ts cost s stll very hgh. The rest two strateges are n the md range. They are nfluenced by the datasets usage rates more, and ther total costs are hgher than our strategy. Fgure 10 (b) shows the cost of the system wth the usage rate of every dataset halved n the pulsar workflow. From ths fgure we can see, n the system wth a low ntermedate datasets reuse rate, the store all strategy becomes hghly cost neffectve, and the store none strategy becomes relatvely cost effectve. Agan, our strategy s stll the most cost-effectve one among the fve strateges. From all the smulatons we have done on the pulsar searchng workflow, we fnd that depends on dfferent ntermedate datasets usage rates, our strategy can reduce the system cost by 46.3%-74.7% n comparson to the store all strategy; 45.2%-76.3% to the store none strategy; 23.9%-58.9% to the store hgh generaton cost datasets strategy; and 32.2%-54.7% store often used datasets strategy respectvely. Furthermore, to examne the generalty of our strategy, we have also conducted many smulatons on randomly generated workflows and ntermedate data. Due to the space lmt, we can not present them here. Based on the smulatons, we can reach the concluson that our ntermedate data storage strategy has a good performance. By automatcally selectng the valuable datasets to store, our strategy can sgnfcantly reduce the total cost of the pulsar searchng workflow. 10

VI. RELATED WORKS Comparng to the dstrbuted computng systems lke cluster and grd, a cloud computng system has a cost beneft [4]. Assunção et al. [5] demonstrate that cloud computng can extend the capacty of clusters wth a cost beneft. Usng Amazon clouds cost model and BOINC volunteer computng mddleware, the work n [16] analyses the cost beneft of cloud computng versus grd computng. The dea of dong scence on the cloud s not new. Scentfc applcatons have already been ntroduced to cloud computng systems. The Cumulus project [22] ntroduces a scentfc cloud archtecture for a data centre, and the Nmbus [15] toolkt can drectly turns a cluster nto a cloud whch has already been used to buld a cloud for scentfc applcatons. In terms of the cost beneft, the work by Deelman et al. [11] also apples Amazon clouds cost model and demonstrates that cloud computng offers a cost-effectve way to deploy scentfc applcatons. The above works manly focus on the comparson of cloud computng systems and the tradtonal dstrbuted computng paradgms, whch shows that applcatons runnng on cloud have cost benefts. However, our work studes how to reduce the cost f we run scentfc workflows on the cloud. In [11], Deelman et al. present that storng some popular ntermedate data can save the cost n comparson to always regeneratng them from the nput data. In [2], Adams et al. propose a model to represent the trade-off of computaton cost and storage cost, but have not gven the strategy to fnd ths trade-off. In our paper, an nnovatve ntermedate data storage strategy s developed to reduce the total cost of scentfc cloud workflow systems by fndng the trade-off of computaton cost and storage cost. Ths strategy can automatcally select the most approprate ntermedate data to store, not only based on the generaton cost and usage rate, but also the dependency of the workflow ntermedate data. The study of data provenance s mportant n our work. Due to the mportance of data provenance n scentfc applcatons, much research about recordng data provenance of the system has been done [13] [6]. Some of them are especally for scentfc workflow systems [6]. Some popular scentfc workflow systems, such as Kepler [17], have ther own system to record provenance durng the workflow executon [3]. In [20], Osterwel et al. present how to generate a Data Dervaton Graph (DDG) for the executon of a scentfc workflow, where one DDG records the data provenance of one executon. Smlar to the DDG, our IDG s also based on the scentfc workflow data provenance, but t depcts the dependency relatonshps of all the ntermedate data n the system. Wth the IDG, we know where the ntermedate data are derved from and how to regenerate them. VII. CONCLUSIONS AND FUTURE WORK In ths paper, based on an astrophyscs pulsar searchng workflow, we have examned the unque features of ntermedate data management n scentfc cloud workflow systems and developed a novel costeffectve strategy that can automatcally and dynamcally select the approprate ntermedate datasets of a scentfc workflow to store or delete n the cloud. The strategy can guarantee the stored ntermedate datasets n the system are all necessary, and can dynamcally check whether the regenerated datasets need to be stored, and f so, adjust the storage strategy accordngly. Smulaton results of utlsng ths strategy n the pulsar searchng workflow ndcate that our strategy can sgnfcantly reduce the total cost of the scentfc cloud workflow system. Our current work s based on Amazon s cloud cost model and assumed all the applcaton data are stored n ts cloud servce. However, sometmes scentfc workflows have to run dstrbuted, snce some applcaton data are dstrbuted and may have fxed locatons. In these cases, data transfer s nevtable. In the future, we wll develop some data placement strateges n order to reduce data transfer among data centres. Furthermore, to wder utlse our strategy, model of forecastng ntermedate data usage rate need to be studed. It must be flexble that can adapt n dfferent scentfc applcatons. ACKNOWLEDGEMENT The research work reported n ths paper s partly supported by Australan Research Councl under Lnkage Project LP0990393. We are also grateful for the dscussons wth Dr. W. van Straten and Ms. L. Levn from Swnburne Centre for Astrophyscs and Supercomputng on the pulsar searchng process. REFERENCE [1] "Amazon Elastc Computng Cloud, http://aws.amazon.com/ec2/", accessed on 28 Jan. 2010. [2] I. Adams, D. D. E. Long, E. L. Mller, S. Pasupathy, and M. W. Storer, "Maxmzng Effcency By Tradng Storage for Computaton," n Workshop on Hot Topcs n Cloud Computng (HotCloud'09), pp. 1-5, 2009. [3] I. Altntas, O. Barney, and E. Jaeger-Frank, "Provenance Collecton Support n the Kepler Scentfc Workflow System," n Internatonal Provenance and Annotaton Workshop, pp. 118-132, 2006. [4] M. Armbrust, A. Fox, R. Grffth, A. D. Joseph, R. H. Katz, A. Konwnsk, G. Lee, D. A. Patterson, A. Rabkn, I. Stoca, and M. Zahara, "Above the Clouds: A Berkeley Vew of Cloud Computng," Unversty of Calforna at Berkeley, http://www.eecs.berkeley.edu/pubs/techrpts/2009/eecs-2009-28.pdf, Techncal Report UCB/EECS-2009-28, accessed on 28 Jan. 2010. [5] M. D. d. Assuncao, A. d. Costanzo, and R. Buyya, "Evaluatng the cost-beneft of usng cloud computng to extend the capacty of clusters," n 18th ACM Internatonal Symposum on Hgh Performance Dstrbuted Computng, Garchng, Germany, pp. 1-10, 2009. [6] Z. Bao, S. Cohen-Boulaka, S. B. Davdson, A. Eyal, and S. Khanna, "Dfferencng Provenance n Scentfc Workflows," n 25th IEEE Internatonal Conference on Data Engneerng, ICDE '09., pp. 808-819, 2009. [7] R. Bose and J. Frew, "Lneage retreval for scentfc data processng: a survey," ACM Comput. Surv., vol. 37, pp. 1-28, 2005. [8] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandc, "Cloud computng and emergng IT platforms: Vson, hype, and realty for delverng computng as the 5th utlty," Future Generaton Computer Systems, vol. n press, pp. 1-18, 2009. 11

[9] E. Deelman, J. Blythe, Y. Gl, C. Kesselman, G. Mehta, S. Patl, M.-H. Su, K. Vah, and M. Lvny, "Pegasus: Mappng Scentfc Workflows onto the Grd," n European Across Grds Conference, pp. 11-20, 2004. [10] E. Deelman and A. Chervenak, "Data Management Challenges of Data-Intensve Scentfc Workflows," n IEEE Internatonal Symposum on Cluster Computng and the Grd, pp. 687-692, 2008. [11] E. Deelman, G. Sngh, M. Lvny, B. Berrman, and J. Good, "The Cost of Dong Scence on the Cloud: the Montage example," n ACM/IEEE Conference on Supercomputng, Austn, Texas, pp. 1-12, 2008. [12] I. Foster, Z. Yong, I. Racu, and S. Lu, "Cloud Computng and Grd Computng 360-Degree Compared," n Grd Computng Envronments Workshop, GCE '08, pp. 1-10, 2008. [13] P. Groth and L. Moreau, "Recordng Process Documentaton for Provenance," IEEE Transactons on Parallel and Dstrbuted Systems, vol. 20, pp. 1246-1259, 2009. [14] C. Hoffa, G. Mehta, T. Freeman, E. Deelman, K. Keahey, B. Berrman, and J. Good, "On the Use of Cloud Computng for Scentfc Workflows," n 4th IEEE Internatonal Conference on e-scence, pp. 640-645, 2008. [15] K. Keahey, R. Fgueredo, J. Fortes, T. Freeman, and M. Tsugawa, "Scence Clouds: Early Experences n Cloud Computng for Scentfc Applcatons," n Frst Workshop on Cloud Computng and ts Applcatons (CCA'08), pp. 1-6, 2008. [16] D. Kondo, B. Javad, P. Malecot, F. Cappello, and D. P. Anderson, "Cost-beneft analyss of Cloud Computng versus desktop grds," n IEEE Internatonal Symposum on Parallel & Dstrbuted Processng, IPDPS'09, pp. 1-12, 2009. [17] B. Ludascher, I. Altntas, C. Berkley, D. Hggns, E. Jaeger, M. Jones, and E. A. Lee, "Scentfc workflow management and the Kepler system," Concurrency and Computaton: Practce and Experence, pp. 1039 1065, 2005. [18] C. Morett, J. Bulosan, D. Than, and P. J. Flynn, "All-Pars: An Abstracton for Data-Intensve Cloud Computng," n IEEE Internatonal Parallel & Dstrbuted Processng Symposum, IPDPS'08, pp. 1-11, 2008. [19] T. Onn, M. Adds, J. Ferrs, D. Marvn, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wpat, and P. L, "Taverna: A tool for the composton and enactment of bonformatcs workflows," Bonformatcs, vol. 20, pp. 3045-3054, 2004. [20] L. J. Osterwel, L. A. Clarke, A. M. Ellson, R. Podorozhny, A. Wse, E. Boose, and J. Hadley, "Experence n Usng A Process Language to Defne Scentfc Workflow and Generate Dataset Provenance," n 16th ACM SIGSOFT Internatonal Symposum on Foundatons of Software Engneerng, Atlanta, Georga, pp. 319-329, 2008. [21] Y. L. Smmhan, B. Plale, and D. Gannon, "A survey of data provenance n e-scence," SIGMOD Rec., vol. 34, pp. 31-36, 2005. [22] L. Wang, J. Tao, M. Kunze, A. C. Castellanos, D. Kramer, and W. Karl, "Scentfc Cloud Computng: Early Defnton and Experence," n 10th IEEE Internatonal Conference on Hgh Performance Computng and Communcatons, HPCC '08., pp. 825-830, 2008. [23] A. Wess, "Computng n the Cloud," ACM Networker, vol. 11, pp. 18-25, 2007. 12