A Data Placement Strategy in Scientific Cloud Workflows

Transcription

1 A Data Placement Strategy in Scientific Clou Workflows Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Faculty of Information an Communication Technologies, Swinburne University of Technology Hawthorn, Melbourne, Australia 3 {yuan, yyang, xliu, jchen}@swin.eu.au ABSTRACT In scientific clou workflows, large amounts of application ata nee to be store in istribute ata centres. To effectively store these ata, a ata manager must intelligently select ata centres in which these ata will resie. This is, however, not the case for ata which must have a fixe location. When one task nees several atasets locate in ifferent ata centres, the movement of large volumes of ata becomes a challenge. In this paper, we propose a matrix base k-means clustering strategy for ata placement in scientific clou workflows. The strategy contains two algorithms that group the existing atasets in k ata centres uring the workflow buil-time stage, an ynamically clusters newly generate atasets to the most appropriate ata centres - base on epenencies - uring the runtime stage. Simulations show that our algorithm can effectively reuce ata movement uring workflow execution. Keywors-ata management; scientific workflow; clou computing;. INTRODUCTION Running scientific workflow applications usually nee not only high performance computing resources but also massive storage [8]. In many scientific research fiels, like astronomy [7], highenergy physics [35] an bio-informatics [39], scientists nee to analyse terabytes of ata either from existing ata resources or collecte from physical evices. During these processes, similar amounts of new ata might also be generate as intermeiate or final proucts [8]. Workflow technologies are facilitate to automate these scientific applications. Scientific workflows are typically very complex. They usually have a large number of tasks an nee a long time for execution. Nowaays, popular scientific workflows are eploye in gri systems [35] because they have high performance an massive storage. However, builing a gri system is extremely expensive an it is not available for scientists all over the worl to use. The emergence of clou computing technologies offers a new way to evelop scientific workflow systems. Since late 7 the concept of clou computing was propose [47] an it has been utilise in many areas with some success [8] [5] [] [38]. Clou computing is eeme as the next generation of IT platforms that can eliver computing as a kin of utility []. Foster et al. mae a comprehensive comparison of gri computing an clou computing [3]. Some features of clou computing also meet the requirements of scientific workflow systems. First, clou computing systems can provie high performance an massive storage require for scientific applications in the same way as gri systems, but with a lower infrastructure construction cost among many other features, because clou computing systems are compose of ata centres which can be clusters of commoity harware. Secon, clou computing systems offer a new paraigm that scientists from all over the worl can collaborate an conuct their research together. Clou computing systems are base on the Internet, an so are the scientific workflow systems eploye on the clou. Disperse computing facilities (like clusters) at ifferent institutions can be viewe as ata centres in the clou computing platform. Scientists can uploa their ata an launch their applications on scientific clou workflow systems from anywhere in the worl via the Internet. As all the ata are manage on the clou, it is easy to share ata among scientists. Research into oing science on the clou has alreay commence such as early experiences like Nimbus [3] an Cumulus [46] projects. The work by Deelman et al. [] shows that clou computing offers a cost-effective solution for ata-intensive applications, such as scientific workflows [9]. By taking avantage of clou computing, scientific workflow systems coul gain a wier utilisation; however they will also face some new challenges, where ata management is one of them. Scientific applications are ata intensive an usually nee collaborations of scientists from ifferent institutions [6], hence application ata in scientific workflows are usually istribute an very large. When one task nees to process ata from ifferent ata centres, moving ata becomes a challenge [8]. Some application ata are too large to be move efficiently, some may have fixe locations that are not feasible to be move an some may have to be locate at fixe ata centres for processing, but these are only one aspect of this challenge. For the application ata that are flexible to be move, we also cannot move them whenever an wherever we want, since in the clou computing platform, ata centres may

2 belong to ifferent clou service proviers that ata movement woul result in costs. Furthermore, the infrastructure of clou computing systems is hien from their users. They just offer the computation an storage resources require by users for their applications. The users o not know the exact physical locations where their ata are store. This kin of moel is very convenient for users, but remains a big challenge for ata management to scientific clou workflow systems. In this paper, we propose a matrix base k-means clustering strategy for ata placement in scientific clou workflow systems. Scientific workflows can be very complex, one task might require many atasets for execution; furthermore, one ataset might also be require by many tasks. If some atasets are always use together by many tasks, we say that these atasets are epenant on each other. In our strategy, we try to keep these atasets in one ata centre, so that when tasks were scheule to this ata centre, most, if not all, of the ata they nee are store locally. Our ata placement strategy has two algorithms, one for the buil-time stage an one for the runtime stage of scientific workflows. In the buil-time stage algorithm, we construct a epenency matrix for all the application ata, which represents the epenencies between all the atasets incluing the atasets that may have fixe locations. Then we use the BEA algorithm [37] to cluster the matrix an partition it that atasets in every partition are highly epenent upon each other. We istribute the partitions into k ata centres, where the partitions have fixe location atasets are also place in the appropriate ata centres. These k ata centres are initially as the partitions of the k-means algorithm at runtime stage. At runtime, our clustering algorithm eals with the newly generate ata that will be neee by other tasks. For every newly generate ataset, we calculate its epenencies with all k ata centres, an move the ata to the ata centre that has the highest epenency with it. By placing ata with their epenencies, our strategy attempts to minimise the total ata movement uring the execution of workflows. Furthermore, with the pre-allocate of ata to other ata centres, our strategy can prevent ata gathering to one ata centre an reuces the time spent waiting for ata by ensuring that relevant ata are store locally. The remainer of the paper is organise as follows. Section presents the relate work. Section 3 gives an example an analyses the research problems. Section 4 introuces the basic strategy of our algorithms. Section 5 presents the etaile steps of the algorithms in our ata placement strategy. Section 6 emonstrates the simulation results an the evaluation. Finally, Section 7 aresses our conclusions an future work.. RELATED WORK Data placement of scientific workflows is a very important an challenging issue. In traitional istribute computing systems, much work about ata placement has been conucte. In [49], Xie propose an energy-aware strategy for ata placement in RAID-structure storage systems. Stork [33] is a scheuler in the Gri that guarantees that ata placement activities can be queue, scheule, monitore an manage in a fault tolerant manner. In [5], Cope et al. propose a ata placement strategy for urgent computing environments to guarantee ata robustness. At the infrastructure level, NUCA [8] is a ata placement an replication strategy for istribute caches that can reuce ata access latency. However, none of them focuses on reucing ata movement between ata centres on the Internet. As clou computing has become more an more popular, new ata management systems have also appeare, such as Google File System [4] an Haoop [3]. They all have hien infrastructures that can store the application ata inepenent of users control. Google File System is esigne mainly for Web search applications, which are ifferent from workflow applications. Haoop is a more general istribute file system, which has been use by many companies, such as Amazon an Facebook. When you push a file to a Haoop File System, it will automatically split this file into chunks an ranomly istribute these chunks in a cluster. Furthermore, the Cumulus project [46] introuce a scientific clou architecture for a ata centre. An the Nimbus [3] toolkit can irectly turn a cluster into a clou an it has alreay been use to buil a clou for scientific applications. Within a small cluster, ata movement is not a big problem, because there are fast connections between noes, i.e. Ethernet. However, the scientific clou workflow system is esigne for scientists to collaborate, where large scale an istribute applications nee to be execute across several ata centres. The ata movement between ata centres may cost a lot of time, since ata centres are sprea aroun the Internet with limite banwith. In this work, we try to place the application ata base on their epenencies in orer to reuce the ata movement between ata centres. Data transfer is a big overhea for scientific workflows [4]. Though popular scientific workflow systems have their ata management strategies, they i not focus on reucing ata movement. For the buil-time stage, these systems mainly focus on ata moelling methos. For example, Kepler [35] has an actor-oriente ata moelling metho that works for large ata in a gri environment, Taverna [39] an ASKALON [48] have their own process efinition language to represent their ata flows. For the

3 runtime stage, most of the scientific workflow systems aopt some ata gri systems for their ata management. For examples, Kepler uses the SRB [7] system, while Pegasus [7] an Triana [4] aopt the RLS system [], Gribus [9] has a gri service broker [45] where all ata are eeme as important resources. Data gris primarily eal with proviing services an infrastructure for istribute ataintensive applications that nee to access, transfer, an moify massive atasets store in istribute storage resources [44]. However, these systems o not consier the epenencies between ata in scientific workflows either at buil-time or runtime an they also can not reuce ata movement. Some researches in gri computing have aresse the importance of ata epenency for the large-scale scientific applications, although they i not focus on workflow ata management. The Filecules project [] groups the files base on the epenencies. Using real workloa experiments ata, the authors emonstrate that filecules grouping is a reliable an useful abstraction for ata management in science Gri. BitDew [] is a istribute ata management system for esktop Gri. Different from ata centres in the clou that aim to provie services to users, esktop Gri aims to make use of the ile computing an storage resources in the esktop computers. In BitDew, the ata placement epenency is enote by a ata attribute calle affinity, which is pre-efine by users. However, in clou computing, all the applciation ata are hoste in the ata centres, where anyone can use the clou services an uploa their ata. Letting users efine the ata epenencies for the scientific clou workflows is clearly impractical. The closest workflow research to ours is the Pegasus workflow system which has propose some ata placement strategies [3] [4] base on the RLS system. The strategies are: first, pre-allocate the require ata to the computation resource where the task will execute; secon, ynamically elete the ata that will no longer be use by tasks. These strategies are only for the runtime stage of scientific workflows an can effectively reuce the overall execution time an the storage usage of the workflows. Furthermore, in [3], the authors propose a ata placement scheuler for istribute computing systems. It guarantees the reliable an efficient ata transfer with ifferent protocols. These works mainly focus on how to move the application ata, an they can not reuce the total ata movement of the whole system. However, our work aims to reuce ata movement. Our strategy is for both buil-time an runtime stages of scientific workflows an we esign specific algorithms to automatically place an move the application ata. In clou computing systems, the infrastructure is hien from users. Hence, for most of the application ata, the system will ecie where to store them. Depenencies exist among these ata. In this paper, we initially aapt the clustering algorithms for ata movement base on ata epenency. Clustering algorithms have been use in pattern recognition since 98s [3], which can classify patterns into groups without supervision. Toay they are wiely use to process ata streams [7]. In many scientific workflow applications, the intermeiate ata movement is in ata stream format an the newly generate ata must be move to the estination in real-time. We aapt the k-means clustering algorithm for ata placement. When new ata is generate by a task, we ynamically calculate the epenencies of the new ata with the K ata centres, an move the new ata to the centre with highest epenency. The simulation results of this paper show that with our ata placement strategy, the ata movement between ata centres is significantly reuce compare to ranom ata placement. 3. SCIENTIFIC CLOUD WORKFLOW DATA MANAGEMENT 3.. A Motivating Example Scientific applications often nee to process terabytes of ata. For example, the ATNF Parkes Swinburne Recorer (APSR) [] is a next-generation baseban ata recoring an processing system currently uner evelopment in collaboration by Swinburne University of Technology an ATNF. The ata from the APSR streams at a rate of one gigabyte per secon. The researchers at Parkes process the ata with a local cluster of servers an o their research. All the ata are store locally at Parkes an they are not available to other institutions. If researchers at other institutions nee the ata resources from the Parkes Raio Telescope, they have to contact the researchers at Parks an request for the ata. Researchers at Parkes will check the local repositories to see if the existing ata resources coul fulfill the requirements. In this situation communications often suffer from low efficiency because researchers are from ifferent projects an the requirements are usually complex. Sometimes researchers even have to go to Parkes an bring back the ata that they nee on har isks. Sharing ata resources in this manner is obviously inefficient an hence not esirable. With clou computing technologies, we can turn the Parkes cluster into a ata centre on a clou computing platform that can offer services to researchers all over the worl. The clou computing platform is built on the Internet, which is how the ata centres are connecte to each other. All the ata are manage by the clou ata management system. The researchers can access the existing ata resources, uploa application ata an launch their applications via the clou service. By oing this, the ATNF refers to the Australian Telescope National Facility.

4 resources at Parkes will be fully utilise, since ata can be sent to other ata centres for ifferent applications as neee. On the other han, researchers at Parkes will be able to o more scientific research by retrieving useful ata from other ata centres aroun the worl. All these ata sening an retrieving operations are hien from the researches. In another wor, via clou computing platform, researchers can utilise ata resources from other institutions without knowing where the ata are physically store. Hence, on a clou computing platform, ata centres shoul have the ability to host each other s ata. For example, if some particular ata at Parkes are frequently retrieve by another ata centre, the system will store these ata on that ata centre instea. Furthermore, if many applications at Parkes nee the same ata from another ata centre, the system will also move those ata to Parkes for storage. The Parkes Raio Telescope was setup in 96. For over 4 years, the Parkes cluster has accumulate a large amount of ata resources in ifferent formats an sizes. Normally, ata can be move to other ata centres, but if the size of the ata is very large, moving them via the Internet will be inefficient. To transport terabytes of ata, the most efficient way is for a elivery company to ship the har isks [5]. If an application nees the majority of its ata from Parkes, it is preferable that it is execute locally an retrieves ata from elsewhere. For example, some research projects may nee to process the raw ata recore from the telescope by APSR, in orer to get some specific results. 3.. Problem Analysis Scientific clou workflows run on the clou platform, which is compose of many istribute ata centres on the Internet (like Parkes cluster) an each connection between ata centres has limite banwith. Tasks sometimes nee to process more than one ataset that may be store in ifferent ata centres. Because of the banwith constraints, the movement of atasets between ata centres woul be the bottle-neck of the system. In [6], the authors propose a new protocol for ata transportation that coul provie gigabits of banwith. However, it has not been wiely supporte by the Internet. The popular clou systems, such as Amazon EC [], still have limite banwith [34]. It charges $. to $.5 per gigabyte to move ata in to an out of Amazon Web Services over the Internet. Another approach to eal with the bottle-neck of large ata transfer is to ivie the tasks, i.e. for the tasks that nee to process many istribute atasets, we split them to many smaller an parallel sub-tasks, an scheule them to ifferent atasets. Map-Reuce technology [6] is a typical an successful paraigm. It gains great success in the Google File System an Haoop, as well as in scientific applications [36]. However, Map-Reuce is more applicable to be use within one ata centre, since it nees huge interconnecte banwith, such as the shuffle step that occurs between the Map proceure an the Reuce proceure. Furthermore, in scientific applications, many tasks must use more than one atasets together an can not be further ivie, such as the All-Pairs problem [38]. Therefore, ata movement is inevitable. In light of this, we have to place the atasets that are neee by the same task in the same ata centre as much as possible, so as to minimise ata movement when the task is execute. The placement of atasets among ata centres is not trivial. Normally, a clou computing system nees to ecie in which ata centres the application ata are store. Most atasets are flexible about where they are store since they are inepenent of users. The clou computing system can automatically store the application ata base on some ata placement strategies. However, in scientific clou workflow systems, some ata are not such flexible. They have to be store in some particular ata centres ue to ifferent reasons. Some common scenarios are emonstrate below. First, some ata may nee to be processe by special equipment. In some scientific projects, many special types of equipment are utilise. Some ata can only be processe by particular equipment since they are in certain formats, e.g. the signal from Parkes Raio Telescope can only be processe by the equipment at Parkes, such as the ASPR. These ata have to be store where the require equipment is locate. Secon, some ata are naturally istribute an too large to be move efficiently. For example, the raw ata files recore by ASPR are usually terabytes or even petabytes in size. They are naturally store in Parkes, an impossible to move to other locations via the Internet. Another reason that some ata must be place at a particular ata centre is about the ownership. Data are consiere as an important an valuable resource in many scientific projects. The clou computing platform offers a new paraigm for cooperation that institutions can easily share their valuable ata resources by placing a charge on them. So the ata with limite access rights have to be store in particular ata centres. No matter what the reason that the ata must be store in a particular ata centre, we call these atasets as fixe location atasets in general. As such, we call the atasets that the system can flexibly ecie where to store flexible location atasets. The ata placement strategy not only has to place the

5 flexible location atasets, but also has to take into account the impact of the fixe location atasets. Some challenges exist in the ata placement strategy as iscusse below. First, in scientific workflows, both tasks an atasets coul be numerous an make up a complicate many-to-many relationship. One task might nee many atasets an one ataset might be neee by many tasks. Furthermore, new atasets will be generate uring the workflow execution. One ataset generate by a task might be use by several later tasks. So the ata placement strategy shoul be base on these ata epenencies. Secon, the scientific clou workflow system is a ynamic computing environment. Many workflow instances will run in the system simultaneously. Some instances might nee long time execution an some might be short. New workflow instances coul eploy to the system an complete instances coul be remove from the system anytime. So the relationships between atasets an tasks will change often an the placement of atasets has to be change accoringly. Thir, the ata management in scientific clou workflow systems is opaque to users, that means users o not know where an how the ata been store. In the clou environment, users only pay for the computation an storage resources that they nee an give the application ata to the system for processing. Because the clou systems are built on the service oriente architecture (SOA), the users just use the ynamic clou services an o not know the infrastructure of the system. Hence, the ata placement has to be automatic. 4. BASIC STRATEGIES FOR DATA PLACEMENT For scientific workflow ata management, there are two types of ata we have to eal with. First is the existing ata that exists before the workflow execution starts. This type of ata mainly inclues the resource ata from the existing file systems or atabases an the application ata from users as input for processing or analysis. Secon is the generate ata that are generate uring the workflow execution. This type of ata mainly inclues the newly generate meiate an result ata, as well as the streaming ata ynamically collecte from scientific evices uring the workflow execution. We propose this taxonomy because we will treat these two types of ata at the workflow buil-time an runtime respectively with ifferent algorithms. This taxonomy only inicates the generation time of the atasets. When the generate ata moves to a ata centre an is store, it becomes existing ata. The most important common feature is that both types of ata might be very large. They can not an shoul not be store an move wherever an whenever we want, since the clou system has the banwith constraints. The application ata of scientific workflow coul also have a variety of formats (e.g. XML ata, complex objects, raw ata files, tables in relational atabases). But in this paper, we o not consier the structure of the ata, since it is not the main focus of this paper an we will treat all ata in the same way. In scientific workflows, moving ata to one ata centre will cost more than scheuling tasks to that centre [3]. Hence, our basic strategy is to have a reasonable placement of ata in istribute ata centres first, so that when tasks are scheule to the appropriate ata centres, almost all the atasets they nee are in local storage. In this work we analyse the epenencies between atasets. Base on this epenency, we aapt the k-means clustering algorithm to cluster atasets to the proper ata centres. In scientific clou workflow systems, many workflow instances will run simultaneously, each of which have complex structures. Large numbers of tasks will access large numbers of atasets an prouce large output ata. In orer to execute a task, all require atasets must be locate on the same ata centre, an this may require some movement of atasets. Furthermore, if two atasets are always use together by many tasks, they shoul be store together in orer to reuce the frequency of ata movement. Here, we say that these two atasets have epenency. In other wors, two atasets are sai to be epenent on each other if they are both use by the same task. The more tasks there are that use the same atasets, the higher the epenency between those atasets. We enote the set of atasets as D an the set of tasks as T. To represent this epenency, we give every ataset a task set in aition to its size. So, every ataset is the set of tasks that i Dhas two attributes enote as <T i, s i >, where T i T will use ataset i, s i enotes the size of i. Furthermore, we use epenency ij to enote the epenency between atasets i an j. We say that the atasets i an j have epenency if there are tasks that will use i an j together an the quantity of this epenency is the number of tasks that use both i an j. All the enotations are liste at the en of the paper.

6 epenency ij = Count ( T T ) i j In this work, our k-means clustering ata placement strategy is base on this epenency that can cluster the atasets into ifferent ata centres. The strategy has two stages: buil-time an runtime. At the buil-time stage, the main goal of the algorithm is to set up k initial partitions for the k-means algorithm. We use a matrix base approach to cluster the existing atasets into k ata centres as the initial partitions. At the runtime stage, the main goal of the algorithm is to cluster the newly generate atasets to one of the k ata centres base on their epenencies, which will be calculate ynamically. We have to esign ifferent algorithms for buil-time an runtime stages to treat the existing ata an generate ata respectively, mainly because of the ynamic nature of the clou environment. Even though we know the size an relate tasks of the atasets that will be generate uring the workflow execution, it is not practical to calculate their epenencies an assign them a ata centre at buil-time stage. This is because the scientific workflows have a large number of tasks an nee a long time for execution. It is very har to preict when a certain ataset will be generate in a ynamic clou environment. If we assign the generate ata a ata centre at the buil-time stage, then when the ata are actually generate the ata centre might have not enough available storage to store them. Furthermore, it is impractical an inefficient to reserve the storage for the generate ata at the buil-time stage. This is because the ata might not be generate until the en of the scientific workflow an it woul be a waste of the reserve storage space uring this time. 5. MATRIX BASED K-MEANS CLUSTERING STRATEGY FOR DATA PLACEMENT Figure. Example of ata placement In this section we will intricately iscuss our ata placement strategy. In Fig., there is an example of a simple workflow instance, an it shows the two stages of our strategy. The ata flows in the workflow instance, for example, from ataset to tasks t an t mean that will be use by both t an t ; an ata flows from t to t an t 3 mean that the ataset generate by t will be use by both t an t 3. During the buil-time stage, we partition the existing atasets into several partitions, enote as p,p p n, base on their epenencies, an istribute these partitions into ifferent ata centres. During

7 the runtime stage, tasks may retrieve atasets from other ata centres as neee, an we also pre-allocate generate atasets to the appropriate ata centres. 5.. Buil-Time Stage Algorithm During the buil-time stage, we use a matrix moel to represent the existing ata. We pre-cluster the atasets by transforming the matrix, an then istributing the atasets to ifferent ata centres as the initial partitions for the k-means clustering algorithm, to be use uring the runtime stage. The builtime stage algorithm has two steps an the pseuocoe is shown in Fig. 4. Step : Setup an cluster the epenency matrix. First, we calculate the ata epenencies of all the atasets an buil up a epenency matrix DM (Line 3 in Fig. 4), where DM s element DM ij = epenency ij. epenency ij is the epenency value between atasets i an j, as we efine in the previous section. It can be calculate by counting the tasks in common between the task sets of i an j, which are enote as T i an T j. Specially, for the elements in the iagonal of DM, each value means the number of tasks that will use this ataset. In our algorithm, DM is an n n symmetrical matrix where n is the total number of existing atasets. If we take the simple workflow instance in Fig. as an example (with only 5 atasets, namely to 5, in the system initially), the epenency matrix DM is shown in Fig.. = Count( T i T j ) Figure. Buil up epenency matrix The epenency matrix (i.e. DM) is ynamically maintaine at the runtime. When new atasets are generate by tasks or ae to the system by users, we calculate their epenencies with all the existing atasets an a them to DM. Next, we use the BEA (Bon Energy Algorithm) to transform the epenency matrix DM (Line 4 in Fig. 4). BEA was propose in 97 [37] an has been wiely utilise in istribute atabase systems for the vertical partition of large tables [4]. It is a permutation algorithm that can group the similar items together in the matrix by permuting the rows an columns. In our work, it takes the epenency matrix (DM) as input, an generates a clustere epenency matrix (CM). In CM, the items with similar values are groupe together (i.e. large values with other large values, an small values with other small values). We efine a global measure (GM) of the epenency matrix: = n n i= j =, + ) GM DM + ij ( DM i j DM i, j The permutation is one in such a way as to maximise this measure. The etaile algorithm of permutation coul be foun in [4]. Fig. 3 shows the CM of the example DM after the BEA transformation Figure 3. BEA transformation of epenency matrix In this step, we o not consier the ifference between fixe location atasets an flexible location atasets. If there are some fixe location atasets in the system, they will be arbitrarily scattere in the columns an rows of the epenency matrix, since we built up the matrix by calculating epenencies between all the atasets. After the BEA transformation, all the atasets, incluing the fixe location atasets, are clustere by their epenencies

8 Buil-time Stage Algorithm Input: D: set of existing atasets,, n DC: set of ata centres c, c, c m Output: K: set of ata centres with initial atasets. K=Ø; FP=Ø; NFP=Ø; //Initialization. FP: set of partitions that have fixe location atasets //NFP: set of partitions that have not fixe location ataset. For (every c i in DC) i_cs i =cs i * λ ini ; //Calculate initial available storage of all ata centres 3. DM = epenency ij = Count (T i T j ) ; //Step : setup DM 4. CM = BEA (DM) ; //Step : BEA transformation 5. if (CM contains f) //Step starts. Check the existence of fixe location atasets 6. Partition&Classify (CM) //Sub-step : partition CM an classify the partitions in to FP an NFP 7. if (CM T contains f & the f belong to ifferent c) 8. Partition&Classify (CM T ) ; //Recursively partition an classify CM T 9. else if (CM T contains f). a CM T to FP ; //CM T has fixe location atasets, a to FP. else a CM T to NFP ; //CM T has not fixe location atasets, a to NFP. if (CM B contains f & the f belong to ifferent c) 3. Partition&Classify (CM B ) ; //Recursively partition an classify CM B 4. else if (CM B contains f) 5. a CM B to FP ; //CM B has fixe location atasets, a to FP 6. else a CM B to NFP ; //CM B has not fixe location atasets, a to NFP 7. for (every ata centre c i in DC) //Sub-step : istribute the partitions with fixe location atasets 8. if (c i has f) //Choose the ata centre c i that has fixe location atasets 9. for (every f j in FD i ) //Go through all the fixe location atasets belong to c i. fin CM j in FP ; //Pick out the partitions that contain these fixe location atasets from PF. a CM j to P i ; //Setup the partitions set P for c i. calculate ps i = cm ; //The total size of the partitions in P j P i s j 3. while (ps i > i_cs i ) //Further partition if the size of P is too large for c i 4. fin CM k in P i, where s k = maxcm P s ; i i i //Largest partition in P 5. remove CM k from P i ; 6. BinaryPartition (CM k ) ; //Partition CM k an upate the partitions sets 7. if (CM kt contains f) a CM kt to P i ; 8. else a CM kt to NFP ; 9. if (CM kb contains f) a CM kb to P i ; 3. else a CM kb to NFP ; 3. calculate ps i = cm j P i s j ; //New size of P after partition 3. istribute all CM j in P i to c i ; //Distribute atasets 33. upate c i to K ; 34. i_cs i = i_cs i ps i ; 35. else a CM to NFP ; //CM o not contain fixe location atasets 36. for (all the partitions CM i in NFP) //Sub-step 3: istribute the partitions without fixe location atasets 37. Partition&Distribute (CM i ) //Partition an istribute CM i 38. m if ( s T < max j= cs j ) //Size of CM it is small enough for some ata centres 39. fin c j from DC, //Fin the best ata centre m 4. where cs i = min j =( cs j > st ) ; 4. istribute CM it to c j ; //Distribute atasets 4. upate c j to K ; 43. i_cs j = i_cs j s it ; 44. else Partition&Distribute (CM it ) ; //Recursively partition an istribute CM it 45. m if ( s B < max j= cs j ) //Size of CM ib is small enough for some ata centres 46. fin c j from DC, //Fin the best ata centre m 47. where cs i = min j =( cs j > sb ); 48. istribute CM ib to c j ; //Distribute atasets 49. upate c j to K ; 5. i_cs j = i_cs j s ib ; 5. else Partition&Distribute (CM ib ) ; //Recursively partition an istribute CM ib 5. Return K ; Figure 4. Buil-time stage algorithm

9 Step : Partition an istribute atasets. In this step we will istribute the atasets to ata centres as the initial k partitions for the k-means clustering algorithm at the runtime stage. We enote the set of ata centres as DC. As shown in Fig., we partition the clustere epenency matrix an place the corresponing atasets to ifferent ata centres. However, each ataset i has a size s i an each ata centre c j also has a storage capacity enote as cs j. To fin the best partitioning of atasets matching the ata centres storage is an NP-har problem, since it coul be reuce to the Knapsack Packing Problem. Here, we evelop a recursive binary partitioning algorithm to fin the approximate best solution. First, we partition CM into two parts {, p } an { p+, p+ n }, which maximises the following measurement: p n ( ) p p n n i= j= CM ij i = p + j = p + CM ij i j = p CM ij PM = = + This measurement, PM, means that atasets in each partition have higher epenencies with each other an lower epenencies with the atasets in the other partitions. Base on this measure we can simply calculate all PMs for p=, n-, an choose p such that it has the maximum PM value as the partition point. After one partition, the CM forms two new clustere matrices, we enote the top one as CM T, which contains the epenencies of atasets D T = {, p } an the bottom one as CM B, which contains the epenencies of atasets D B = { p+, p+ n }. Every clustere matrix represents a partition of atasets an we enote the total size of the atasets it contains as s = i n = s. Hence the s for CM i T an CM B are p st = i = s an i sb s respectively. i = i n = p + Next, we istribute atasets to ata centres by recursively partitioning the clustere epenency matrix. For each of the ata centres, we introuce a percentage parameter λ ini to enote the initial usage of their storage capacity, which means that the initial size of atasets in ata centre c i coul not excee cs i * λ ini. The reason we can not fill the ata centre with their maximum storage is that in scientific workflows, the generate ata can also be very large. We have to reserve sufficient space in ata centres to store those ata uring the workflow execution. λ ini is an experience parameter. The value of λ ini shoul epen on what kins of applications are running on the system, because the generate ata of ifferent applications might have ifferent sizes. Furthermore, we also assume that the ata centres can host all the application ata in the system, i.e. n m i = s i < i= ( cs i λ ini ). To istribute the atasets, we have to examine whether there are fixe location atasets in the system (Line 5 in Fig. 4). If the system oes not have fixe location atasets (Line 35 in Fig. 4), we will recursively partition the sub-matrices CM T an CM B until the size of the sub-matrix can fit into one of the ata centres initial storage size limits (s <= cs i * λ ini ). Then we istribute the atasets in this submatrix into this ata centre, an a the reference of this ata centre (c i ) to K, where K is a set of ata centres. When the partitioning of CM finishes, all the initial atasets are move to proper ata centres. We take the ata centres in K as the initial partitions of the k-means clustering algorithm. If there are fixe location atasets in the system, the istribution process is more complicate. For a fixe location ataset f i, we enote it as <T i, s i, c>, where the aitional attribute c is the ata centre where this ataset has to be store. An we use FD to enote the set of the fixe location atasets a ata centre has. For a ata centre that oes not have fixe location atasets, FD is empty. The istribution is conucte as the three following sub-steps. Sub-step (Line 6-6 in Fig. 4), we classify fixe location atasets an flexible location atasets in ifferent partitions. We also nee to recursively partition the sub-matrices CM T an CM B. The stop conition is that the sub-matrix oes not have fixe location atasets or all the fixe location atasets it has belong to one ata centre. We a the partitions that o not have fixe location atasets to a set name NFP an the partitions have fixe location atasets to a set name FP. Sub-step (Line 7-34 in Fig. 4), we istribute the partitions with fixe location atasets in FP. We nee to check the ata centres information. For the ata centres that have fixe location atasets, we pick out the partitions that contain these fixe location atasets from FP, enote as P. Then, we calculate the total size of these partitions, enote as ps, where ps = CMi P s. If these partitions can fit into this i ata centre, we store them. If not, we recursively pick the largest partition from P, binary partition it an move the part that oes not have fixe location atasets to NFP, until these partitions can fit into the ata centre.

10 Sub-step 3 (Line 36-5 in Fig. 4), we istribute the partitions that only contain flexible location atasets in NFP. We start with the largest one an go through all the partitions in NFP by their size. For every partition, we istribute it to the ata centres by recursive binary partitioning. 5.. Runtime Stage Algorithm At the runtime stage, we use the k-means clustering algorithm to ynamically cluster the generate ata to one of the k ata centres base on their epenencies. An when new workflows are eploye to the system or some ata centres become overloae, we also have to ajust the ata placement among ata centres. The pseuocoe of the runtime stage algorithm is shown in Fig. 5. For the generate ata, some of them coul be valuable resources that can be utilise by other workflows, but most of them are temporal ata. They are generate by the preceing tasks in the workflows an will be use by the subsequent tasks. They o not nee permanent storage an will be elete after the workflows have finish execution. In many scientific applications, the temporal ata are in large volumes [9]. Some researches emonstrate that timely removal of these temporal ata can save a lot of runtime storage space [4]. In our work, we ynamically check an elete the obsolete temporal ata before every roun of task scheuling. The runtime stage algorithm contains the following two steps. Step : Data pre-allocation by the clustering algorithm. In this step, the first thing we have to o is task scheuling (Line -3 in Fig. 5). Scheuling is a very important issue in scientific workflow systems, especially for computation intensive an/or ata intensive applications. Much research has been one into scheuling workflows [43] [5]. However, task scheuling is not the main focus of this paper. Therefore, our scheuling strategy is quite straight forwar. We just follow the philosophy of moving ata to a ata centre will cost more than scheuling tasks to that centre, an scheule tasks base on the placement of atasets. We perioically monitor the state of all the workflow tasks an ynamically scheule the reay tasks to the ata centre which has the most atasets they require. Here, a task is reay if all the atasets it nees are existing ata (i.e. have been generate). When tasks have been execute, new atasets will be generate. The system will then ecie where to put these atasets: either store them locally or allocate them to other ata centres. In our work, the system will cluster the newly generate atasets to the ata centre that has the highest epenency with them (Line 4- in Fig. 5). We efine the epenency between ataset i an ata centre c j as c_ep ij, which is the sum of the epenencies of i with all the atasets in c j. Suppose u is a new generate ataset an T u is the set of tasks that will use u. First, we calculate the epenencies of u with all other atasets in the system an a the new row an column to DM for u, where DM ui = DM iu = epenencyui = Count{ Tu Ti } i =,,... n Then we calculate the epenencies of u with all the k ata centres, where c _ ep uj = epenency um, j =,,... k m c j With these epenencies, we will select the ata centre c h that has the highest epenency with u, where c _ ep uh = max = ( c _ epuj ) k j c h is the ata centre in which we will store the ataset u. An we will check the available storage of c h, before we move u to it. Here we will introuce a maximum storage usage parameter λ max for ata centres, which is a percentage threshol inicating whether a ata centre is overloae or not. λ max is also an experience parameter, just like the initial storage usage parameter λ ini. Hence, the storage that the runtime ata can use of a ata centre c i is cs i *(λ max -λ ini ). The value of λ max epens on the overall workloa of the system. If the system workloa is heavy, λ max has to be set to a larger value. Likewise, if the system workloa is light, λ max is set smaller to prevent too many atasets gathering in one ata centre. We will move the new generate ataset u to the selecte ata centre c h, if cs hλ + su < cshλ max is true, where s u is the size of u an λ is the current storage usage percentage of c h. Otherwise, we go to the next step to ajust the ata placement.

11 t i T { T T }, i =, n epenencyui = Count u i,... c k DC c _ ep = epenency um m c k c _ ep uh = max = ( c _ epuj ) cs hλ + su < cshλ max k j λ = max i j i c j DC j λ j / i c c j i j / i c c Figure 5. Runtime stage algorithm Step : Ajust ata placement among ata centres. During workflow execution, there are two situations that trigger the nee to ajust the ata placement among ata centres. The first is when the selecte estination ata centre c h for the new generate ataset oes not have enough available storage. This means that c h is overloae. Hence, we have to ajust the atasets placement to balance the overall workloa of the system. The secon is when new workflows are eploye to the system. Together with the new workflows, new atasets an tasks will be ae to the system. The epenencies of the original atasets will change, since the new tasks might use the existing ata in the system. In this situation, we will calculate the epenencies between the new atasets an the existing atasets, an a them to the epenency matrix DM. If there are any new tasks which use existing ata, they will be ae to the task set of the appropriate existing ataset. For every new ataset, we will fin an appropriate ata centre for it by following the proceure in step. If the selecte ata centre is overloae, we have to ajust the atasets placement to balance the overall workloa of the system. To ajust the ata placement, we nee to run some functions from the buil-time stage algorithms (Line 5-6 in Fig. 5). First, we o the BEA transformation to cluster the upate epenency matrix (DM) an get a new clustere epenency matrix (CM ). Next, we run the algorithm in step of the buil-time stage, but without actual ata istribution. We just calculate the new placement of atasets in the ata centres an save the references in a new set of ata centres, enote as K. Then we can o the ajustment by comparing the ol ata placement with the new one in K (Line 7-4 in Fig. 5). We start the ajustment from the ata centre that has the highest storage loa an go through all the ata centres by the storage usage in the ecreasing orer. For every ata centre, we compare the atasets it currently has with the new atasets in K. Then we sen the atasets that o not belong to this ata centre to the ones they now belong to an retrieve the atasets it shoul have from other ata centres.

12 Since λ max represents a percentage of a ata centre's total storage space, each ata centre will still have some storage available (% - λ max ) to facilitate ata movement uring this reistribution. In the case that λ max is set to %, aitional temporary storage space may nee to be acquire to serve as a buffer before the ajustment process can be complete. However, this situation rarely happens in the system, ue to the following reasons: ) in the ajustment process we always select the ata centre with the highest storage usage to ajust as the priority, an sen its atasets to other ata centres first; ) the total size of the atasets in the system is smaller than the total size of the available storage of all the ata centres ( n m i= s i < i= ( cs i λ ini )), because we have the assumption that the ata centres can host all the application ata in the system; an 3) for every ata centre we reserve some storage for the runtime generate atasets ( cs ( λmax λ ini )), this storage space is not always highly utilise, because we elete obsolete atasets ynamically. In our system, for every ata centre, we reserve runtime storage for generate atasets as 4% of the initial storage for existing atasets i.e. ( λ max λ ini ) λini = 4%. As aresse in section 6 later, we have run tens of thousans of workflow instances for simulation, an a situation where we lacke storage for ata reallocation i not occur. The ata placement strategy in this section states that when a task is scheule to one ata centre uring workflow execution, that ata centre will have most input atasets for that task. Then, only a small number of atasets have to be retrieve from remote ata centres. The simulations in the next section will show that our ata placement strategy can greatly reuce the total ata movement uring workflow execution. 6. SIMULATION 6.. Simulation Environment: SwinDeW-C SwinDeW-C (Swinburne Decentralise Workflow for Clou) [5] is evelope base on SwinDeW [5] an SwinDeW-G [5]. It is currently running at Swinburne University of Technology, which is compose of servers an high-en PCs. To simulate the clou computing environment, we set up VMware [4] software on the physical servers an create virtual clusters as ata centres. Fig. 6 shows our simulation environment. Figure 6. Simulation environment of SwinDeW-C Every ata centre create is compose of 8 virtual computing noes with storages, an we eploy an inepenent Haoop file system on each ata centre. SwinDeW-C runs on these virtual ata centres that can sen an retrieve ata to an from each other. Through a user interface at the applications layer, which is a Web base portal, we can eploy workflows an uploa application ata.

13 SwinDeW-C is esigne for large scale clou applications. It has a novel architecture for the clou computing environment. However, the presentation of the comprehensive system esign of SwinDeW-C is not the main focus of this paper. In Fig. 7, we only illustrate the key system components of SwinDeW- C that relate to the ata placement strategy. User Interface Moule: The clou computing platform is built on the Internet an a Web browser is normally the only software neee at the client sie. This interface is a Web portal by which users can visit the system an eploy their applications. The Uploaing Component is for users to uploa application ata an workflows, an the Monitoring Component is for users, as well as system aministrators to monitor workflow execution. Data Management Moule: The Data Placement Component is the core component of ata management in SwinDeW-C that facilitates the algorithms in our ata placement strategy. The Data Catalogue is use to store the information of applications which, in a service oriente clou platform, is a registry for the ata services. By using the catalogue, the system can locate the ata neee. Other components in this moule, such as Data Replication Component, Data Synchronisation Component, Meta-ata Repository an Provenance Data Collection are also essential for clou ata management. Since they are not irectly relate to the ata placement strategy, we o not give their etails here. Other Moules: The Flow Management Moule has a Process Repository that stores all the workflow instances running in the system. The Task Management Moule has a Scheuler that scheules reay tasks to ata centres uring the runtime stage of the workflows. Furthermore, the Resource Management Moule keeps the information of the ata centres usage, an can trigger the ajustment process in the ata placement strategy. For other components in these moules, as well as other moules in SwinDeW-C, we o not give the etails as the work presente here only focuses on the workflow ata management. Figure 7. Relate key system components of SwinDeW-C 6.. Simulation Strategies The algorithms in our ata placement strategy are for the buil-time an runtime stages respectively. To evaluate their performance, we run each workflow instance through 4 simulation strategies: Ranom: In this simulation, we ranomly place the existing ata uring the buil-time stage an store the generate ata in the local ata centre (i.e. where they were generate) at runtime. This simulation represents the traitional ata placement strategies in ol istribute computing systems (i.e. clusters an early gri systems). At that time, ata were usually store in the local noe naturally or in the noes that ha available storages. The temporal intermeiate ata, i.e. generate ata, were also naturally store where they were generate waiting for the tasks to retrieve them. Buil-time only: This simulation shows the performance of our buil-time algorithm. It is use to place the existing ata at buil-time. During the runtime stage we will store the generate ata in the local ata centre, as with the Ranom simulation. In a clou computing system, ata are more flexible

14 than they were in the past; this allows the system can ecie where to store them. Our buil-time algorithm places the application ata base on their epenencies. This simulation will show the ata movement reuction in the workflows execution by using this algorithm. Runtime only: This simulation shows the performance of the runtime algorithm by ranomly placing the existing ata at buil-time an by pre-allocating the generate ata with our runtime algorithm. This simulation represents the strategy that some popular gri scientific workflows use [3]. Their work shows that pre-allocating ata to the computing noe where the tasks will execute can reuce the total execution time of the workflow. However, this simulation will show that only pre-allocating ata at runtime stage can not reuce the ata movement in workflow execution. Buil & Run: This simulation shows the overall performance of our algorithms both at buil-time an runtime. Our algorithms are specifically esigne for scientific clou workflows. The strategy is base on ata epenency an can automatically place existing ata; an cluster generate ata to the appropriate ata centres. Comparisons with other strategies will be mae with ifferent aspects to show the performance of our algorithms. The traitional way to evaluate the performance of a workflow system is to recor an compare the execution time [3] [4]. However, in our work we will count the total ata movement instea. The execution time coul be influence by other factors besie ata management, such as banwith, scheuling strategy an I/O spee. Our ata placement strategy aims to reuce the ata movement between ata centres on the Internet. So we irectly take the number of atasets that are actually move uring the workflow execution as the measurement to evaluate the performance of the algorithms. In a clou computing environment with limite banwith base on the Internet, if the total ata movement has been reuce, the execution time will be reuce corresponingly. Furthermore, the cost of ata transfer will also ecrease. To make the evaluation as objective as possible, we generate test workflows ranomly to run on SwinDeW-C. This woul make the evaluation results inepenent of any specific applications. As we nee to run the buil-time an runtime algorithms separately, we set the number of existing atasets an generate atasets to be the same for every test workflow. That means that we have the same number of existing atasets an tasks for every test workflow, an we assume that each task will only generate one ataset. We can control the complexity of the test workflow by changing the number of atasets. Every ataset will be use by a ranom number of tasks, an tasks that use generate atasets must be execute after the task that generates their input. We can control the complexity of the relationships between the atasets an tasks by changing the range of this ranom number. Another factor that woul have impact on the algorithms is the number of fixe location atasets. We can ranomly choose some percentage of atasets from the existing ata an ranomly select some ata centres for them. We will run new simulations to show the impact on performance. Here we have only inclue graphs of the simulation results. The etaile configuration an result reports of the simulations, as well as the source coe can all be foun at Simulation Results Fig. 8 shows the ata movement when we run workflows with ifferent complexity on ifferent numbers of ata centres. We can see the increases in ata movement as the workflows become more complex an the number of ata centres increases. All the values in the figure are the average of running test workflows with the same parameters. In Fig. 8 (a), we ran the test workflows with ifferent complexity on 5 ata centres. We use 4 types of test workflows with ifferent numbers of atasets. In Fig. 8 (b), we fixe the test workflows atasets count to 5, an ran them on ifferent numbers of ata centres. Then we change % of the input atasets to fixe location atasets an ran the same simulation again. The results are shown in Fig. 9. From the results, we coul raw the conclusions that ) the buil-time algorithm can effectively reuce the total ata movement of the workflow execution; ) the runtime algorithm oes not reuce the total ata movement, an even causes more ata movement if the existing atasets are place ranomly an 3) with fixe location atasets ae to the system, our algorithms can still work very well with performance only egraing slightly. The runtime algorithm oes not ecrease the ata movement because it pre-allocates atasets before scheuling tasks base on their ata epenencies. If the existing atasets are ranomly place, the iffering epenencies of the ata centres are not obvious. The increase in ata movement is cause by pre-allocation of atasets to the wrong ata centres. However, if the existing atasets were clustere by the buil-time algorithm, the performance of the runtime algorithm woul be better.

15 Ranom BuiltimeOnly RuntimeOnly Buil&Run Data Sets Data Centres Figure 8. Data movements without runtime storage limit an without fixe location atasets Ranom BuiltimeOnly RuntimeOnly Buil&Run Data Sets Data Centres Figure 9. Data movements without runtime storage limit an with % of fixe location atasets However, in the simulation escribe above, we i not limit the amount of storage that the ata centres ha available uring runtime. The reason for this is that we wante to see how the tasks an atasets were istribute, which inicates the workloa balance among ata centres. During the execution of every test workflow instance, we recore the number of atasets that move to each ata centre, as well as the tasks that scheule to that ata centre. We also calculate the stanar eviation of the ata centres usage. Fig. shows the average stanar eviation of running test workflows on 5 ata centres each having 8 existing atasets an 8 tasks, both with an without fixe location atasets. From Fig. we can see relatively high eviations in the ata centres usage in the two simulations without the runtime algorithm. This means that tasks an atasets are allocate to one ata centre more frequently. This leas to a ata centre becoming a super noe that has a high workloa. By contrast, in the other two simulations that use the runtime algorithm to pre-allocate the generate ata to other ata centres, the eviation of ata centre usage is low. This emonstrates that the runtime algorithm can make a more balance istribution of the workloa among ata centres. In a clou computing environment, ata centres normally have limite storage, especially in some storage constraine systems. When one ata centre is overloae, we nee to reallocate the ata to other ata centres. The reallocation will not only cause extra ata movement, but will also elay the execution of the workflow. To count the reallocate atasets, we ran the same test workflows as in Fig. with a storage limit in every ata centre. We limite the runtime storage for generate atasets to 4% of the

16 initial storage for existing atasets i.e. ( λ λ ini ) λ % max ini = 4 movement incluing the ata reallocation.. In Fig. we show the average ata D atasets M o vement T asks Scheuling Stanar Deviation Stanar Deviation Ranom BuiltimeOnly RuntimeOnly Buil&Run Ranom BuiltimeOnly RuntimeOnly Buil&Run Figure. Stanar eviation of workloa among ata centres Data Retrieve Data Sent Data Reallocate Ranom BuiltimeOnly RuntimeOnly Buil&Run Ranom BuiltimeOnly RuntimeOnly Buil&Run (a) Without fixe location ata (b) With % fixe location ata Figure. Proportions of 3 types of ata movements From Fig., we can see that a lot of ata is reallocate in the simulations without the runtime algorithm. The least ata reallocation occurre when we only use the runtime algorithm. However, the least ata movement in total occurre when using the buil-time an runtime algorithms together. In Fig. (a), using both algorithms cause movements of atasets on average. Comparing this to the ranom simulation, atasets movements on average, our algorithms reuce the ata movement by 5.8%. On the other han, the buil-time algorithm an runtime algorithm cause movement of 7.6 an atasets on average. Compare to the ranom situation, they reuce the ata movements by 4.8% an 4.% respectively. In Fig. (b), with % fixe location atasets in the system, our algorithms (Buil&Run) can reuce the ata movement by 47.4% compare to the Ranom simulation. To better evaluate the performance of our algorithms, we give every ata centre a runtime storage limit an run the same simulation workflows as Fig. 8. We get the final results of ata movement which are shown in Fig.. From Fig. we can see that as the number of ata centres an atasets increases, the performance of the buil-time algorithm ecreases. This is because without the runtime algorithm the atasets an tasks are gathering on the one ata centre. This triggers the ajustment process more frequently, which costs extra ata movements. Furthermore, we ran the same simulation as Fig. uner the conition that the system has fixe location atasets. Fig. 3 shows the ata movements when we set the percentage of fixe location atasets to %. We can see our algorithms can still reuce the ata movements significantly. Furthermore, with higher percentages of fixe location atasets in the system, our algorithms still work, an we will emonstrate this in the next simulation.

17 Ranom BuiltimeOnly RuntimeOnly Buil&Run Data Sets (a) Figure. Data movements with runtime storage limit 5 5 Data Centres (b) Ranom BuiltimeOnly RuntimeOnly Buil&Run Data Sets Data Centres Figure 3. Data movements with runtime storage limit an with % fixe location atasets Fig. 3 has consistent results with Fig. 9, that the fixe location atasets have a negative impact on the algorithms performance. In the algorithms, we try to place the atasets on ata centres base on epenencies, however, the fixe location atasets have to be store in particular ata centres. This will ecrease performance, as fixe location atasets will prevent the algorithms from placing atasets with their epenencies. However, given the existence of fixe location atasets, our algorithms can still reuce ata movement by placing the flexible location atasets with epenencies. To emonstrate the impact of fixe location atasets on the algorithms, we conucte another batch of simulations. We ran test workflows on 5 ata centres each having 8 existing atasets an 8 tasks, but with ifferent percentages of fixe location atasets. As the number of fixe location atasets increases, we can see their impact on ata movement in Fig. 4. From Fig. 4 (a) we can see that as the percentage of fixe location atasets goes up, the ata movements of the Buil-time only an Buil & Run simulations go up accoringly; however the Ranom an Runtime only simulations keep steay. This means the fixe location atasets primarily have an impact on the buil-time algorithm. This is because all the fixe location atasets are existing ata, which are place by the buil-time stage algorithm. When the percentage reaches 6%, the ata movements of Buil & Run simulation even excees the Ranom simulation. This is because the preallocation of atasets in the runtime algorithm causes more ata movements, as the buil-time algorithm gets worse. In Fig. 4 (b) it may seem slightly confusing that the ata movements of all simulations go up an then rop, as the percentage of fixe location atasets goes up. This is because when we set the runtime storage limit, many ata movements are cause by ata reallocation. However, the fixe location atasets are not involve in the overloa ajustment process. Hence, the ata movement

18 ecreases. In this figure we can also see that the fixe location atasets may have a negative impact on the buil-time algorithm. Ranom BuiltimeOnly RuntimeOnly Buil&Run % % % 3% 4% 5% 6% 7% Percentage of Fixe Datasets 5 % % % 3% 4% 5% 6% 7% Percentage of Fixe Datasets Figure 4. Data movements with ifferent percentage of fixe location atasets 7. CONCLUSIONS AND FUTURE WORK In this paper, we examine the unique features of scientific clou workflows an propose a clustering ata placement strategy that can automatically allocate application ata among ata centres base on epenencies. Simulations in our clou workflow system SwinDeW-C inicate that our ata placement strategy can effectively reuce ata movement uring workflow execution. The buil-time algorithm reuces the amount of ata retrieve an the run time algorithm guarantees a balance istribution of ata an can reuce ata movement incurre by ata reallocation, even when fixe location ata exist in the system. In our current work, to guarantee the ata reliability, we use Haoop s replication mechanism within a ata centre, an among ata centres we i not use any replication strategies. The ata use in scientific workflow applications are usually very large an as such it is not efficient to replicate all the application ata in the system. However, replication of frequently use ata coul also reuce ata movement. In the future work, we will evelop some efficient replication strategies for the ata placement algorithm, which coul balance the ata movement an storage usage. Furthermore, in our current simulation we measure the reuction of atasets movements to evaluate our strategy. In the future, we will meter the execution time of the workflow as well, which can better emonstrate the effectiveness of our strategy. To be more comprehensive, we will also incorporate the size of atasets to calculate the ata epenency, an aapt some popular clou service proviers pricing moels to our simulation, which will show the cost effectiveness of our strategy. ACKNOWLEDGMENT The research work reporte in this paper is partly supporte by Australian Research Council uner Linkage Project LP We are grateful to Bryce Gibson an Michael Jensen for the accomplishment of the simulation work, as well as the carefull English proofreaing. REFERENCES [] "Amazon Elastic Computing Clou, accesse on 5 November 9. [] "ATNF Parkes Swinburne Recorer, accesse on 5 November 9. [3] "Haoop, accesse on 5 November 9. [4] "VMware, accesse on 5 November 9. [5] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, an M. Zaharia, "Above the Clous: A Berkeley View of Clou Computing," University of California at Berkeley, Technical Report UCB/EECS-9-8, accesse on 5 November 9.

19 [6] R. Barga an D. Gannon, "Scientific versus Business Workflows," in Workflows for e-science, pp. 9-6, 7. [7] C. Baru, R. Moore, A. Rajasekar, an M. Wan, "The SDSC Storage Resource Broker," in IBM Centre for Avance Stuies Conference, Toronto, Canaa pp. -, 998. [8] M. Brantner, D. Florescuy, D. Graf, D. Kossmann, an T. Kraska, "Builing a Database on S3," in SIGMOD, Vancouver, BC, Canaa, pp. 5-63, 8. [9] R. Buyya an S. Venugopal, "The Gribus Toolkit for Service Oriente Gri an Utility Computing: An Overview an Status Report," in IEEE International Workshop on Gri Economics an Business Moels, Seoul, pp. 9-66, 4. [] R. Buyya, C. S. Yeo, an S. Venugopal, "Market-Oriente Clou Computing: Vision, Hype, an Reality for Delivering IT Services as Computing Utilities," in th IEEE International Conference on High Performance Computing an Communications (HPCC-8), Los Alamitos, CA, USA, 8. [] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, an I. Branic, "Clou computing an emerging IT platforms: Vision, hype, an reality for elivering computing as the 5th utility," Future Generation Computer Systems, vol. in press, pp. -8, 9. [] A. Chervenak, E. Deelman, I. Foster, L. Guy, W. Hoschek, A. Iamnitchi, C. Kesselman, P. Kunszt, M. Ripeanu, B. Schwartzkopf, H. Stockinger, K. Stockinger, an B. Tierney, "Giggle: A Framework for Constructing Scalable Replica Location Services," in ACM/IEEE conference on Supercomputing, Baltimore, Marylan, pp. -7,. [3] A. Chervenak, E. Deelman, M. Livny, M.-H. Su, R. Schuler, S. Bharathi, G. Mehta, an K. Vahi, "Data Placement for Scientific Applications in Distribute Environments," in 8th Gri Computing Conference, pp , 7. [4] D. Churches, G. Gombas, A. Harrison, J. Maassen, C. Robinson, M. Shiels, I. Taylor, an I. Wang, "Programming scientific an istribute workflow with Triana services," Concurrency an Computation: Practice an Experience, vol. 8, pp. -37, 6. [5] J. M. Cope, N. Trebon, H. M. Tufo, an P. Beckman, "Robust ata placement in urgent computing environments," in IEEE International Symposium on Parallel & Distribute Processing, IPDPS 9, pp. - 3, 9. [6] J. Dean an S. Ghemawat, "MapReuce: simplifie ata processing on large clusters," Commun. ACM, vol. 5, pp. 7-3, 8. [7] E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, S. Patil, M.-H. Su, K. Vahi, an M. Livny, "Pegasus: Mapping Scientific Workflows onto the Gri," in European Across Gris Conference, pp. -, 4. [8] E. Deelman an A. Chervenak, "Data Management Challenges of Data-Intensive Scientific Workflows," in IEEE International Symposium on Cluster Computing an the Gri, pp , 8. [9] E. Deelman, D. Gannon, M. Shiels, an I. Taylor, "Workflows an e-science: An overview of workflow system features an capabilities," Future Generation Computer Systems, vol. In Press, Correcte Proof. [] E. Deelman, G. Singh, M. Livny, B. Berriman, an J. Goo, "The Cost of Doing Science on the Clou: the Montage example," in ACM/IEEE Conference on Supercomputing, Austin, Texas, pp. -, 8. [] S. Doraimani an A. Iamnitchi, "File grouping for scientific ata management: lessons from experimenting with real traces," in Proceeings of the 7th international symposium on High performance istribute computing Boston, MA, USA: ACM, 8, pp [] G. Feak, H. He, an F. Cappello, "BitDew: a programmable environment for large-scale ata management an istribution," in Proceeings of the 8 ACM/IEEE conference on Supercomputing, Austin, Texas, pp. -, 8. [3] I. Foster, Z. Yong, I. Raicu, an S. Lu, "Clou Computing an Gri Computing 36-Degree Compare," in Gri Computing Environments Workshop, GCE '8, pp. -, 8. [4] S. Ghemawat, H. Gobioff, an S.-T. Leung, "The Google file system," SIGOPS Oper. Syst. Rev., vol. 37, pp. 9-43, 3. [5] R. Grossman an Y. Gu, "Data Mining Using High Performance Data Clous: Experimental Stuies Using Sector an Sphere," in SIGKDD, pp. 9-97, 8. [6] R. Grossman, Y. Gu, M. Sabala, an W. Zhang, "Compute an storage clous using wie area high performance networks," Future Generation Computer Systems, pp , 8. [7] S. Guha, A. Meyerson, N. Mishra, R. Motwani, an L. O'Callaghan, "Clustering ata streams: Theory an practice," IEEE Transactions on Knowlege an Data Engineering, vol. 5, pp , 3. [8] N. Haravellas, M. Ferman, B. Falsafi, an A. Ailamaki, "Reactive NUCA: near-optimal block placement an replication in istribute caches," in Proceeings of the 36th annual International Symposium on Computer Architecture, ISCA '9, Austin, TX, USA, pp , 9. [9] C. Hoffa, G. Mehta, T. Freeman, E. Deelman, K. Keahey, B. Berriman, an J. Goo, "On the Use of Clou Computing for Scientific Workflows," in 4th IEEE International Conference on e-science, pp , 8. [3] A. K. Jain, M. N. Murty, an P. J. Flynn, "Data clustering: a review," ACM Comput. Surv., vol. 3, pp , 999. [3] K. Keahey, R. Figueireo, J. Fortes, T. Freeman, an M. Tsugawa, "Science Clous: Early Experiences in Clou Computing for Scientific Applications," in First Workshop on Clou Computing an its Applications (CCA'8), pp. -6, 8.

20 [3] T. Kosar an M. Livny, "A framework for reliable an efficient ata placement in istribute computing systems," Journal of Parallel an Distribute Computing, vol. 65, pp , 5. [33] T. Kosar an M. Livny, "Stork: making ata placement a first class citizen in the gri," in Proceeings of 4th International Conference on Distribute Computing Systems, ICDCS 4, pp , 4. [34] H. Liu an D. Orban, "GriBatch: Clou Computing for Large-Scale Data-Intensive Batch Applications," in Eighth IEEE International Symposium on Cluster Computing an the Gri, pp , 8. [35] B. Luascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, an E. A. Lee, "Scientific workflow management an the Kepler system," Concurrency an Computation: Practice an Experience, pp , 5. [36] A. Matsunaga, M. Tsugawa, an J. Fortes, "ClouBLAST: Combining MapReuce an Virtualization on Distribute Resources for Bioinformatics Applications," in 4th IEEE International Conference on e-science, pp. -9, 8. [37] W. T. McCormick, P. J. Sehweitzer, an T. W. White, "Problem Decomposition an Data Reorganization by a Clustering Technique," Operations Research, vol., pp , 97. [38] C. Moretti, J. Bulosan, D. Thain, an P. J. Flynn, "All-Pairs: An Abstraction for Data-Intensive Clou Computing," in IEEE International Parallel & Distribute Processing Symposium, IPDPS'8, pp. -, 8. [39] T. Oinn, M. Ais, J. Ferris, D. Marvin, M. Senger, M. Greenwoo, T. Carver, K. Glover, M. R. Pocock, A. Wipat, an P. Li, "Taverna: A tool for the composition an enactment of bioinformatics workflows," Bioinformatics, vol., pp , 4. [4] M. T. Ozsu an P. Valuriez, Principles of istribute atabase systems: Prentice-Hall, Inc. Upper Sale River, NJ, USA, 99. [4] R. Proan an T. Fahringer, "Overhea Analysis of Scientific Workflows in Gri Environments," IEEE Transactions on Parallel an Distribute Systems, vol. 9, pp , 8. [4] G. Singh, K. Vahi, A. Ramakrishnan, G. Mehta, E. Deelman, H. Zhao, R. Sakellariou, K. Blackburn, D. Brown, S. Fairhurst, D. Meyers, G. B. Berriman, J. Goo, an D. S. Katz, "Optimizing Workflow Data Footprint," Scientific Programming, vol. 5, pp , 7. [43] S. Venugopal an R. Buyya, "An SCP-base heuristic approach for scheuling istribute ata-intensive applications on global gris," J. Parallel Distrib. Comput., vol. 68, pp , 8. [44] S. Venugopal, R. Buyya, an K. Ramamohanarao, "A Taxonomy of Data Gris for Distribute Data Sharing, Management, an Processing," ACM Comput. Surv., vol. 38, pp. -53, 6. [45] S. Venugopal, R. Buyya, an L. Winton, "A Gri Service Broker for Scheuling Distribute Data-Oriente Applications on Global Gris," in n Workshop on Mileware in Gri Computing, Toronto, Canaa, pp. 75-8, 4. [46] L. Wang, J. Tao, M. Kunze, A. C. Castellanos, D. Kramer, an W. Karl, "Scientific Clou Computing: Early Definition an Experience," in th IEEE International Conference on High Performance Computing an Communications, HPCC '8., pp , 8. [47] A. Weiss, "Computing in the Clou," ACM Networker, vol., pp. 8-5, 7. [48] M. Wieczorek, R. Proan, an T. Fahringer, "Scheuling of Scientific Workflows in the ASKALON Gri Environment," SIGMOD Recor, vol. 34, pp. 56-6, 5. [49] T. Xie, "SEA: A Striping-Base Energy-Aware Strategy for Data Placement in RAID-Structure Storage Systems," IEEE Transactions on Computers, vol. 57, pp , 8. [5] J. Yan, Y. Yang, an G. K. Raikunalia, "SwinDeW - A PP-Base Decentralize Workflow Management System," IEEE Transactions on Systems, Man an Cybernetics, Part A, vol. 36, pp , 6. [5] Y. Yang, K. Liu, J. Chen, J. Lignier, an H. Jin, "Peer-to-Peer Base Gri Workflow Runtime Environment of SwinDeW-G," in IEEE International Conference on e-science an Gri Computing, pp. 5-58, 7. [5] Y. Yang, K. Liu, J. Chen, X. Liu, D. Yuan, an H. Jin, "An Algorithm in SwinDeW-C for Scheuling Transaction-Intensive Cost-Constraine Clou Workflows," in 4th IEEE International Conference on e- Science, pp , 8.

21 DENOTATIONS: i ataset D set of atasets D i set of atasets in a partition f i fixe location ataset FD set of fixe location atasets t i workflow task T set of workflow tasks T i set of workflow tasks that will use ataset i c i ata centre DC set of ata centres p i partition of atasets P set of partitions s i size of a ataset cs size of a ata centre s size of a partition ps size of a set of partitions FP set of partitions that have fixe location atasets NFP set of partitions that o not have fixe location atasets DM epenency matrix CM clustere epenency matrix CM i sub clustere epenency matrix CM T the top sub clustere epenency matrix after one binary partition CM B the bottom sub clustere epenency matrix after one binary partition GM global measure of BEA transformation PM global measure of binary partition ep ij epenency between atasets i an j c_ep ij epenency between ataset i an ata centre c j K set of ata centres with placement of atasets λ ini initial storage usage parameter of ata centres maximum storage usage parameter of ata centres λ max