Grid and Cloud Scheduling and Resource Management

Grid and Cloud Scheduling and Resource Management Dick Epema ASCI a24 1 Parallel and Distributed Group/

Outline Introduction Resource Management in Grids versus Clouds Condor Flocking in Grids Processor Co-allocation Cycle Scavenging Performance Variability in Clouds ASCI a24 2

Grids and Clouds Grids: large, multi-organizational collections of heterogeneous machines (desktops, clusters, supercomputers) best-effort, no cost model Clouds: large, uni-organizational, homogeneous data centers weak SLAs, cost model ASCI a24 3

Grids versus Clouds heterogeneous homogeneous many types of systems datacenters multi-organizational clearly defined organization scientific applications web/ applications authentication issues authentication by credit card one user - many resources elasticity no industry involvement much industry involvement free (?) payment model energy awareness ----------------------------- + ------------------------------- + Grids Clouds ASCI a24 4

Job types in Grids and Clouds Sequential jobs/bags-of-tasks E.g., parameter sweep applications (PSAs) These account for the vast majority of jobs of grids Leads to high-throughput computing Parallel jobs Extension of high-performance computing In most grids, only a small fraction Workflows Multi-stage data filtering, etc Represented by directed (acyclic) graphs MapReduce applications (in clouds) Miscellaneous Interactive simulations Data-intensive applications (also in clouds) Web applications (in clouds) ASCI a24 5

A General Resource Management Framework (1) ASCI a24 6

A General Resource Management Framework (2) Local resource management system (Resource Level) Basic resource management unit Provides low level resource allocation and scheduling e.g., PBS, SGE, LSF Global resource management system (Grid Level) Coordinates the local resource management systems within multiple or distributed sites Provides high-level functionalities to efficiently use resources: Job submission Resource discovery and selection Authentication, authorization, accounting Scheduling Job monitoring Fault tolerance ASCI a24 7

How to select resources in the grid? Grid scheduling is the process of assigning jobs to grid resources Grid Resource Broker/ Scheduler Job Job Job Local Resource Manager Local Resource Manager Local Resource Manager Clusters (Space Shared Allocation) Single CPU (Time Shared Allocation) Clusters (Space Shared Allocation) ASCI a24 8

Resource Characteristics in Grids (1) Autonomous each resource has its own management policy or scheduling mechanism no central control/multi-organizational sets of resources Heterogeneous hardware (processor architectures, disks, network) basic software (OS, libraries) grid software (middleware) systems management (security set-up, runtime limits) -/+ ASCI a24 9

Resource Characteristics in Grids (2) Size large numbers of nodes, providers, consumers large amounts of data Varying Availability resources can join or leave to the grid at any time due to maintenance, policy reasons, and failures Insecure and unreliable environment prone to various types of attacks + -/+? ASCI a24 10

Problems in Grid Scheduling (1) 1. Grid schedulers do not own resources themselves they have to negotiate with autonomous local schedulers authentication/multi-organizational issues 2. Grid schedulers have to interface to different local schedulers some may have support for reservations, others are queuing-based some may support checkpointing, migration, etc 3. Structure of applications many different structures (parallel, PSAs, workflows, database, etc.) need for application adaptation modules =? ever more support ASCI a24 11

Problems in Grid scheduling (2) 4. Lack of a reservation mechanism but with such a mechanism we need good runtime estimates 5. Heterogeneity - see above 6. Failures monitor the progress of applications/sanity of systems only thing we know to do upon failures: restart (possibly from a checkpoint)? 7. Performance metric turn-around time cost 8. Reproducibility of performance experiments +/- ASCI a24 12

Example Grids: The Globus Toolkit Widely used grid middleware Negotiate and control access to resources GRAM component for control of local execution Globus toolkit includes software services and libraries for: resource monitoring resource specification language resource discovery and management security file management (gridftp) Was the dominant grid middleware in the period 1997-2006 Main designers: Ian Foster (Chicago) and Carl Kesselman (LA) Achievement award talk Ian Foster at HPDC 12 in Delft online at www.hpdc.org/2012 www.globus.org ASCI a24 13

Condor (1/8): batch queuing system + grids Condor is a high-throughput scheduling system started around 1986 as one of many batch queuing systems for clusters (of desktop machines), and has survived! supports cycle scavenging: use idle time on clusters of machines introduced the notions of matchmaking and classads provides remote system calls, a queuing mechanism, scheduling policies, priority scheme, resource monitoring initiated and still being developed by Miron Livny, Madison, Wisc. http://research.cs.wisc.edu/condor D.H.J. Epema, M. Livny, R. van Dantzig, X. Evers, and J. Pruyne, "A Worldwide Flock of Condors: Load Sharing among Workstation Clusters, Future Generation Computer Systems, Vol. 12, pp. 53-65, 1996. ASCI a24 14

Condor (2/8): matchmaking Basic operation of Condor: 1c 1a jobs send classads to the matchmaker 1b machines send classads to the matchmaker 1c the matchmaker matches jobs and machines 1d and notifies the submission machine 2a which starts a shadow process is that represents the remote job on the execution machine 2b/c and contacts the execution machine 3b/c on the execution machine, the actual remote user job is started ASCI a24 15

Condor (3/8): combining pools (design 1) Federation with a Flock Central Manager: FCM Disadvantages: who maintains the FCM? FCM is single point of failure the matchmakers have to be modified modified ASCI a24 16

Condor (4/8): combining pools (design 2) Protocol between Central Managers: Disadvantage: the matchmaker has to be modified ASCI a24 17

Condor (5/8): combining pools (design 3) Connecting pools network-style with GateWays: Condor flocking GW GW GW Advantages: transparent to the matchmaker no component to be maintained by a third party in the GWs, any access policies and resource sharing policies can be implemented ASCI a24 18

Condor (6/8): flocking as before submission pool execution pool as before MM MM Conclusions: Design considerations for Condor Flocking are still very valid when joining systems (cloud federations?) Nice, clear, transparent research solution that was too complex in practice submission machine GateWay presents itself as machine of other pool GateWay presents remote job as if it is its own execution machine ASCI a24 19

Condor (7/8): user flocking Conclusion: Condor and Condor flocking have survived 20 years of grid computing!! ASCI a24 20

Experimentation: DAS-4 VU (148 CPUs) SURFnet6 10 Gb/s lambdas UvA/MultimediaN (72) UvA (32) System purely for CS research Operational since oct. 2010 Specs: 1,600 cores (quad cores) 2.4 GHz CPUs accelerators (GPUs) 180 TB storage 10 Gb/s Infiniband Gb Ethernet TU Delft (64) Leiden (32) Astron (46) ASCI a24 21

KOALA: a co-allocating grid scheduler Original goals: 1. processor co-allocation: parallel applications 2. data co-allocation: job affinity based on data locations 3. load sharing: in the absence of co-allocation while being transparant for local schedulers Additional goals: research vehicle for grid and cloud research support for (other) popular application types KOALA is written in Java is middleware independent (initially Globus-based) has been deployed on the DAS2 - DAS4 since sept 2005 ASCI a24 22

KOALA: the runners The KOALA runners are adaptation modules for different application types: set up communication / name server / environment launch applications + perform application-level scheduling scheduling policies Current runners: CSRunner: for cycle-scavenging applications (PSAs) IRunner: for applications using the Ibis Java library Mrunner: for malleable parallel applications OMRunner: for co-allocated parallel OpenMPI applications Wrunner: for co-allocated workflows MR-runner: for MapReduce applications (under construction) ASCI a24 23

Co-Allocation (1) In grids, jobs may use resources in multiple sites: co-allocation or multi-site operation Reasons: to benefit from available resources (e.g., processors) to access and/or process geographically spread data application characteristics (e.g., simulation in one location, visualization in another) Resource possession in different sites can be: simultaneous (e.g., parallel applications) coordinated (e.g., workflows) With co-allocation: more difficult resource-discovery process single global job need to coordinate allocations by autonomous resource managers ASCI a24 24

Co-allocation (2): job types fixed job job components non-fixed job job component placement fixed scheduler decides on component placement flexible job total job same total job size scheduler decides on split up and placement single cluster of same total size ASCI a24 25

Co-allocation (3): slowdown Co-allocated applications are less efficient due to the relatively slow wide-area communications Slowdown of a job: execution time on multicluster execution time on single cluster (>1 usually) Processor co-allocation is a trade-off between faster access to more capacity, and higher utilization shorter execution times ASCI a24 26

Co-allocation (4): scheduling policies Placement policies dictate where the components of a job go Placement policies for non-fixed jobs: 1. Load-aware: Worst Fit (WF) (balance load in clusters) 2. Input-file-location-aware: Close-to-Files (CF) (reduce file-transfer times) 3. Communication-aware: Cluster Minimization (CM) (reduce number of wide-area messages) Placement policies for flexible jobs: 1. Communication-aware: Flexible Cluster (CM for flexible) Minimization (FCM) 2. Network-aware: Communication-Aware (take latency into account) (CA) ASCI a24 27

Co-allocation (5): simulations/analysis Model has a host of parameters Main conclusions: Co-allocation is beneficial when the slowdown 1.20 Unlimited co-allocation is no good: limit the number of job components limit the maximum job-component size Give local jobs some but not absolute priority over global jobs Mathematical analysis for maximal utilization See, e.g.: 1. A.I.D. Bucur and D.H.J. Epema, Trace-Based Simulations of Processor Co-Allocation Policies in Multiclusters, IEEE/ACM High Performance Distributed Computing (HPDC) 2003. 2. A.I.D. Bucur and D.H.J. Epema, Scheduling Policies for Processor Co-Allocation in Multicluster Systems," IEEE Trans. on Parallel and Distributed Systems, Vol. 18, pp. 958-972, 2007. ASCI a24 28

Co-Allocation (6): Experiments on the DAS3 average execution time (s) average job response time (s) Delft poorly connected number of clusters combined FCM CA FCM CA [w/o Delft] [with Delft] O.O. Sonmez, H.H. Mohamed, and D.H.J. Epema, On the Benefit of Processor Co-Allocation in Multicluster Grid Systems, IEEE Trans. on Parallel and Distributed Systems, Vol.21, pp. 778-789, 2010. ASCI a24 29

Cycle Scavenging (1): System Requirements 1. Unobtrusiveness Minimal delay for (higher priority) local and grid jobs 2. Fairness Multiple cycle scavenging applications running concurrently should be assigned comparable CPU-Time 3. Dynamic Resource Allocation Cycle scavenging applications have to Grow/Shrink at runtime 4. Efficiency As much use of dynamic resources as possible 5. Robustness and Fault Tolerance Long-running, complex system: problems will occur, and must be dealt with ASCI a24 30

Cycle Scavenging (2): Policies 1. Equipartition-All (grid-wide basis) CS User-1 Clusters CS User-2 CS User-3 Non-CS jobs 2. Equipartition-PerSite (per-cluster basis) CS User-1 Clusters CS User-2 CS User-3 Non-CS jobs ASCI a24 31

Cycle Scavenging (3): Experimental Results Number of Completed Tasks Equi-All Equi-All Equi-PerSite Equi-PerSite WBlock WBurst WBlock WBurst Equi-PerSite is fair and superior to Equi-All O.O. Sonmez, B. Grundeken, H.H. Mohamed, A. Iosup, and D.H.J. Epema, Scheduling Strategies for Cycle Scavenging in Multicluster Grid Systems," 9th IEEE/ACM Int'l Symposium on Cluster Computing and the Grid (CCGRID09), May 2009. ASCI a24 32

Research Questions A. Iosup, M.N. Yigitbasi, and D.H.J. Epema, On the Performance Variability of Production Cloud Services, 11th IEEE/ACM Int'l Symp. on Cluster, Cloud, and Grid Computing (CCGrid), pp. 104-113, 2011. ASCI a24 33

Outline 1. Introduction and Motivation 2. Method 3. Q1: Analysis of performance variability 1. Analysis of the AWS Dataset 2. Analysis of the GAE Dataset 4. Q2: Impact of variability on large-scale applications 1. Grid & PPE Job Execution 2. Selling Virtual Goods in Social Networks 3. Game Status Maintenance for Social Games ASCI a24 34

Production Cloud Services Amazon Web Services (AWS) IaaS EC2 (Elastic Compute Cloud) S3 (Simple Storage Service) SQS (Simple Queueing Service) SDB (Simple Database) FPS (Flexible Payment Service) Google App Engine (GAE) PaaS Run (Python/Java runtime) Datastore (Database) Memcache (Caching) URL Fetch (Web crawling) ASCI a24 35

Our Method (1): Performance Traces CloudStatus (www.cloudstatus.com, now defunct) real-time values and weekly averages for most of the AWS and GAE services values obtained with periodic performance probes (sampling rate is under 2 minutes) data obtained from the cloudstatus website using web crawls ASCI a24 36

Our Method (2): Analysis 1. Data selection Select several months worth of data to see whether a performance metric is highly variable 2. Find the characteristics of variability Basic statistics: the five quartiles (Q0-Q4) including the median (Q2), the mean, the standard deviation Derivative statistic: the IQR (Q3-Q1) E.g., coefficient of variation > 1.1 indicates high variability 3. Analyze the performance variability time patterns Investigate for each performance metric the presence of daily/ monthly/weekly/yearly time patterns E.g., for monthly patterns divide the dataset into twelve subsets and for each subset compute the statistics and plot for visual inspection ASCI a24 37

Our Method (3): Is Variability Present? Use statistics or visual inspection EC2 min Q1 median Q3 Max IQR mean Std Dev 57.0 73.6 75.7 78.5 122.1 4.9 76.62 5.17 ASCI a24 38

Outline 1. Introduction and Motivation 2. Method 3. Q1: Analysis of performance variability 1. Analysis of the AWS Dataset 2. Analysis of the GAE Dataset 4. Q2: Impact of variability on large-scale applications 1. Grid & PPE Job Execution 2. Selling Virtual Goods in Social Networks 3. Game Status Maintenance for Social Games ASCI a24 39

AWS Dataset (1): EC2 Deployment Latency [s]: Time it takes to start a small instance, from the startup to the time the instance is available Higher IQR and range in weeks 41-52 due to increasing EC2 user base? has impact on applications using EC2 auto-scaling ASCI a24 40

AWS Dataset (2): S3 Get Throughput [bytes/s]: Estimated rate at which an object in a bucket is read The last five months of the year exhibit much lower IQR and range more stable performance for the last five months may be due to software/infrastructure upgrades ASCI a24 41

AWS Dataset (3): SQS Variable Performance Stable Performance Average Lag Time [s]: Time it takes for a posted message to become available to read (average taken over multiple queues) Long periods of stability (low IQR and range) Periods of high performance variability also exist ASCI a24 42

GAE Dataset (1): Run Service Fibonacci [ms]: Time it takes to calculate the 27 th Fibonacci number Highly variable performance until September Last three months have stable performance (low IQR and range) ASCI a24 43

GAE Dataset (2): Datastore Read Latency [s]: Time it takes to read a User Group Yearly pattern from January to August? The last four months of the year exhibit much lower IQR and range ASCI a24 44

GAE Dataset (3): Memcache PUT [ms]: Time it takes to put 1 MB of data in memcache Median performance per month has an increasing trend over the first 10 months The last three months of the year exhibit stable performance ASCI a24 45

Outline 1. Introduction and Motivation 2. Background 3. Q1: Analysis of performance variability 1. Analysis of the AWS Dataset 2. Analysis of the GAE Dataset 4. Q2: Impact of variability on large-scale applications 1. Grid & PPE Job Execution 2. Selling Virtual Goods in Social Networks 3. Game Status Maintenance for Social Games ASCI a24 46

Experimental Setup (1): Simulations Trace-based simulations for three applications Input Grid Workload Archive traces (job execution) Number of daily unique users (other two apps) Monthly performance availibility/variability Application Service Job Execution Selling Virtual Goods Game Status Maintenance GAE Run AWS FPS AWS SDB + GAE Datastore ASCI a24 47

Experimental Setup (2): Metrics For Job Execution: average Response Time and Average Bounded Slowdown cost in millions of consumed CPU hours For the other two applications: Aggregate Performance Penalty -- APR(t) P(t): Pref : U(t): max U(t): performance at time t (in some metric) reference performance: Average of the twelve monthly medians number of users at time t max number of users over the whole trace the lower the better (how many users are hit?) ASCI a24 48

Game Status Maintenance (1): Scenario Maintenance of game status for a large-scale social game such as Farm Town, which has millions of unique users daily For such games, status maintenance is done with simple database operations Trace: number of users of Farm Town AWS SDB and GAE Datastore We assume that the number of database operations depends linearly on the number of daily unique users ASCI a24 49

Game Status Maintenance (2): Results AWS SDB GAE Datastore Big discrepancy between SDB and Datastore services Sep 09-Jan 10: APR of Datastore is well below than that of SDB due to increasing performance of Datastore APR of Datastore ~1 => no performance penalty APR of SDB ~1.4 => 40% larger performance penalty than SDB ASCI a24 50

Other issues in cloud performance How to efficiently use elasticity? How to use virtual machine migration? How to insulate VMs of different users from each other? How to meet SLAs (e.g., use predictions) How to benefit from data locality (same unit/rack/data center)? ASCI a24 51

The Archives: Grid Workloads Motivation: little is known about real grid use No grid workloads available (except my grid ) No standard way to share them The Grid Workloads Archive facilitates sharing grid workload traces and associated research: understand how real grids are used address challenges facing grid resource management develop and test solutions (simulations, experiments) A. Iosup, H. Li, M. Jan, S. Anoep, C. Dumitrescu, L. Wolters, D.H.J. Epema, The Grid Workloads Archive, Future Generation Computer Systems, Vol. 24, 672 686, 2008. ASCI a24 52