DYNAMIC CLOUD PROVISIONING FOR SCIENTIFIC GRID WORKFLOWS Simon Ostermann, Radu Prodan and Thomas Fahringer Institute of Computer Science, University of Innsbruck Technikerstrasse 21a, Innsbruck, Austria simon@dps.uibk.ac.at
OVERVIEW Introduction Optimized Cloud Provisioning Cloud Start Instance Size Grid Scheduling Cloud Stop Evaluation using 3 scientific workflows Wien2k Invmod Meteoag Conclusion
INTRODUCTION Infrastructure as a Service a branch of Cloud computing On-demand resources i.e.: Amazon EC2, GoGrid,... Other common Cloud computing areas not covered: Platform as a Service Software as a Service Specialized solutions for Storage, Web hosting,...
CLOUD COMPUTING FOR SCIENTIFIC COMPUTING? Rent resources instead of buying own hardware Eliminates permanent operation, maintenance, and deprecation costs Scale up/down an infrastructure based on temporary immediate needs Significantly reduced over-provisioning Virtualised resources enables scalable deployment and provisioning of application software Reliability through business SLA relationships that bind actors to offering higher QoS guarantees
nothing 50 CLOUD MODELS Unallocated 100 Requested 100 Starting 100 Running 30 Accessible 270 Shutting down 50 Cloud computing mostly available on a hourly basis Terminated 10 Unallocated 100 Some research papers assume finer granularity )*%+(%,-#./*'( =#>-(&%''%,6( %,-#./*'( 0,*''3"*-#+( 7#98#$-#+( 4-*.5,6( 78,,%,6(!""#$%&'#( 4:8;,6(+3<,( 0,*''3"*-#+( 1%2#( 0$*&'#(%,-#./*'( 1#.2%,*-#+( Interesting problems arise: How much do i use this full hour? How can i maximize the usage / minimize the cost?
GRID COMPUTING Grid has emerged as a worldwide shared distributed platform for solving large-scale scientific problems Grid computing with additional Cloud resources to speed up scientific computing Just in time Scheduler from ASKALON, a workflow execution system for Grid and Cloud resources ASKALON is a Workflow system developed by the DPS group at the University of Innsbruck Multiple scientific workflows from different fields of science
GROUDSIM Grid and Cloud Simulator Event based for scalability reasons Experiments showed up to 90% better performance and better scalability then GridSim Java based - to allow integration into existing software Simulation allows wide analysis of Cloud without expenses Simulation results match real executions
GROUDSIM ARCHITECTURE./($0!"#* 12-/* 3$4$/-*+5-)4* 6"24*73+68* Put events in list Get next event!"#$%&'()*+),")-* Infrastructure + application simulation Callbacks ="24/">$'()* :&;<,/($)0* 6(&0-/* Generate failure ="24/">$'()* 3&"%$/-*.-)-/&4(/* Submit jobs Transfer files./"0*&)0*9%($0*+)''-2* ="24/">$'()*
OPTIMIZED CLOUD PROVISIONING Analysis of regular executions and the resulting costs Analysis resulted in multiple parts needing optimization Choices have to be made about: start and stop of resources and the amount of instances requested Four optimizations found, defined as algorithms (in the paper) and exploited in the evaluation
CLOUD START Grid core 3 120 120 Grid core 2 120 120 Parallel Grid regions core 1 with 120 more tasks 120 then available cores Cloud core 1 250 Depending of Cloud and Grid speed Serialization and Imbalance overheads are analyzed Grid core 3 120 120 Grid core 2 120 120 When Grid minimization core 1 120 of the runtime of the parallel section is Cloud core 1 300 possible Cloud resources are started 2+&'34'536*7" :;.3437)+" %&'(")*&+"#" -*."#" -*."/" 84*9(")*&+"#" -*."/" %&'(")*&+"#" -*."#" %&'(")*&+"$" -*."$" -*."0" %&'(")*&+"$" -*."$" -*."0" %&'(")*&+"," -*."," -*."1" %&'(")*&+"," -*."," -*."1" <';+"!" #!!" $!!" <';+"!" #!!" $!!",!!"
INSTANCE SIZE Instances may offer different number of cores When only part of the Cloud cores are used the cost efficiency is lower Getting to little cores may result in serialization / no benefit Important to decide if number of instances to request is rounded up or down resulting in 2 behaviors: generous: better performance but more expensive economical: less expensive but performance may not improve
GRID SCHEDULING Grid is a dynamical shared environment Resources may become available while workflow execution uses Cloud resources Rescheduling resources to Grid might save cost / might decrease execution time depending of work already completed from a job mapped to a Cloud resource and the speed difference from Grid and Cloud decisions are made
CLOUD STOP Unused resources are shut down to save money Shutdown after 5 minutes of a payed hour is as expensive as after 58 minutes Resources might be reused in the upcoming 53 minutes and this reuse will reduce the overall Cloud provisioning overheads Shut down time is in payed period therefor the point in time has to be chosen knowing the Shut down time of the Cloud in some case: 1 hour of cloud time can be saved
EVALUATION Three different scientific workflows with different levels of parallelism Execution simulated using GroudSim Impact of different optimizations on the three workflows when using 3 different types of Cloud resources and 3 Clusters from the Austrian Grid
METRIC Comparison of executions on Grid resources and executions using Grid and additional on demand Cloud resources We define a new metric CT called cost per unit of saved time ($/T) Represents how expensive a unit of saved execution time comes with the assumption that Grid resources are freely available
WORKFLOWS From different fields of science with different structures Parallelisation size x representing a factor that represents the amount of tasks in a workflow which is evaluated for values from 1-900 Computationally intensive, data transfers are small part of each workflow Cloud network speed and storage influence kept low Simulation data based on real executions in the Austrian Grid
GENERAL OBSERVATIONS 180 160 140 120 100 80 60 40 Cost [$] Grid+m1.small (Cloud stop) Grid+m1.large (Cloud stop) Grid+c1.xlarge (Cloud stop) Grid+m1.small (no opt.) Grid+m1.large (no opt.) Grid+c1.xlarge (no opt.) 20 0 0 100 200 300 400 500 600 700 800 900 Parallelisation size [x] Comparison of regular and optimized executions of different big workflows
WIEN2K Vienna University of Technology Theoretical chemistry (materials science) Electronic structure calculations for solids using density functional theory Number of activities 2 * x + 3 x = parallelisation size
WIEN2K Time [hours] Cost [$] 35 30 25 20 15 10 5 0 180 160 140 120 100 80 60 40 20 0 Grid Grid + m1.small Grid + m1.large Grid + c1.xlarge 0 100 200 300 400 500 600 700 800 900 Grid + m1.small Grid + m1.large Grid + c1.xlarge Parallelisation size [x] 0 100 200 300 400 500 600 700 800 900 Parallelisation size [x] 1 Execution times and cost on the Grid and with additional Cloud resources Cost / Saved time [min/$], logarithmic scale [log C T ] 10 0.1 0.01 Cost per unit of saved time ($/T) for the three different Cloud with logarithmic scale 0 100 200 300 400 500 600 700 800 900 Parallelisation size [x] Grid + m1.small Grid + m1.large Grid + c1.xlarge
INVMOD A hydrological application using Levenberg-Marquardt algorithm to minimize the error between simulation and measurements Number of activities 12 * x + 1 x = parallelisation size
INVMOD Time [hours] Cost [$] 50 45 40 35 30 25 20 15 10 250 200 150 100 50 0 Grid Grid + m1.small Grid + m1.large Grid + c1.xlarge 50 100 150 200 250 300 Grid + m1.small Grid + m1.large Grid + c1.xlarge Parallelisation size [x] 50 100 150 200 250 10 300 Parallelisation size [x] Execution times and cost on the Grid and with additional Cloud resources Cost / Saved time [min/$], logarithmic scale [log C T ] 100 1 0.1 0.01 Cost per unit of saved time ($/T) for the three different Cloud with logarithmic scale 50 100 150 200 250 300 Parallelisation size [x] Grid + m1.small Grid + m1.large Grid + c1.xlarge
stageout METEOAG Meteorology and Geophysics Institute Meteorological simulations with the numerical model RAMS Resolve alpine watersheds and thunderstorms in the Arlberg region of the West Austria simulation_init case 1 case 2 case n case_init case_init case_init rams_makevfile rams_makevfile rams_makevfile Initial Conditions Initial Conditions Initial Conditions rams_init 6 h Simulation revu_compare Post Process raver Verify and Select Number of activities 69 * x + 2 x = parallelisation size no continue? yes rams_hist 18 h Simulation revu_dump Post Process
METEOAG 160 140 120 100 80 60 40 20 0 900 800 700 600 500 400 300 200 100 0 Grid Grid + m1.small Grid + m1.large Grid + c1.xlarge 50 100 150 200 250 300 Grid + m1.small Grid + m1.large Grid + c1.xlarge Parallelisation size [x] 50 100 150 200 250 300 10 Parallelisation size [x] Execution times and cost on the Grid and with additional Cloud resources Cost / Saved time [min/$], logarithmic scale [log C T ] 100 1 0.1 0.01 Cost per unit of saved time ($/T) for the three different Cloud with logarithmic scale 50 100 150 200 250 300 Parallelisation size [x] Grid + m1.small Grid + m1.large Grid + c1.xlarge
CONCLUSION Granularity of Cloud payment has an important roll in Cloud allocation decisions Optimizations like the presented needed to allow efficient usage of this dynamic resource class The longer Cloud resources needed the lower the impact Future extension with full graph scheduling algorithms planed
THANK YOU Any questions?