Grid and Cloud Scheduling and Resource Management

Similar documents
Scheduling and Resource Management in Grids and Clouds

Towards an Optimized Big Data Processing System

Keywords: Cloudsim, MIPS, Gridlet, Virtual machine, Data center, Simulation, SaaS, PaaS, IaaS, VM. Introduction

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Cluster, Grid, Cloud Concepts

Delft University of Technology Parallel and Distributed Systems Report Series. On the Performance Variability of Production Cloud Services

Service Oriented Distributed Manager for Grid System

Grid Scheduling Dictionary of Terms and Keywords

Network Infrastructure Services CS848 Project

Condor for the Grid. 3)

PERFORMANCE ANALYSIS OF PaaS CLOUD COMPUTING SYSTEM

C-Meter: A Framework for Performance Analysis of Computing Clouds

Manjrasoft Market Oriented Cloud Computing Platform

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

C-Meter: A Framework for Performance Analysis of Computing Clouds

This is an author-deposited version published in : Eprints ID : 12902

Introduction to Cloud Computing

An Evaluation of Economy-based Resource Trading and Scheduling on Computational Power Grids for Parameter Sweep Applications

CONDOR And The GRID. By Karthik Ram Venkataramani Department of Computer Science University at Buffalo

On the Performance Variability of Production Cloud Services

Cloud computing. Intelligent Services for Energy-Efficient Design and Life Cycle Simulation. as used by the ISES project

CloudAnalyst: A CloudSim-based Visual Modeller for Analysing Cloud Computing Environments and Applications

Computing in clouds: Where we come from, Where we are, What we can, Where we go

Multilevel Communication Aware Approach for Load Balancing

Garuda: a Cloud-based Job Scheduler

Distribution transparency. Degree of transparency. Openness of distributed systems

Big Data and Cloud Computing for GHRSST

Keywords Cloud computing, virtual machines, migration approach, deployment modeling

International Journal of Advance Research in Computer Science and Management Studies

Manjrasoft Market Oriented Cloud Computing Platform

- Behind The Cloud -

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study

An approach to grid scheduling by using Condor-G Matchmaking mechanism

Virtual Machine Based Resource Allocation For Cloud Computing Environment

ECE6130 Grid and Cloud Computing

Eucalyptus: An Open-source Infrastructure for Cloud Computing. Rich Wolski Eucalyptus Systems Inc.

LSKA 2010 Survey Report Job Scheduler

Condor and the Grid Authors: D. Thain, T. Tannenbaum, and M. Livny. Condor Provide. Why Condor? Condor Kernel. The Philosophy of Flexibility

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures

Grid Computing Vs. Cloud Computing

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

JSSSP, IPDPS WS, Boston, MA, USA May 24, 2013 TUD-PDS

Grid Computing Approach for Dynamic Load Balancing

Volunteer Computing and Cloud Computing: Opportunities for Synergy

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

System Models for Distributed and Cloud Computing

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing

Cloud Computing Simulation Using CloudSim

Monitoring Elastic Cloud Services

Task Scheduling in Hadoop

Amazon EC2 Product Details Page 1 of 5

Cloud Computing. Chapter 1 Introducing Cloud Computing

Virtual machine interface. Operating system. Physical machine interface

Infrastructure as a Service (IaaS)

White Paper. Cloud Native Advantage: Multi-Tenant, Shared Container PaaS. Version 1.1 (June 19, 2012)

Figure 1. The cloud scales: Amazon EC2 growth [2].

Condor-G: A Computation Management Agent for Multi-Institutional Grids

Synthetic Grid Workloads with Ibis, KOALA, and GrenchMark

DiPerF: automated DIstributed PERformance testing Framework

Round Robin with Server Affinity: A VM Load Balancing Algorithm for Cloud Based Infrastructure

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS

Payment minimization and Error-tolerant Resource Allocation for Cloud System Using equally spread current execution load

Cloud Design and Implementation. Cheng Li MPI-SWS Nov 9 th, 2010

wu.cloud: Insights Gained from Operating a Private Cloud System

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

EFFICIENT SCHEDULING STRATEGY USING COMMUNICATION AWARE SCHEDULING FOR PARALLEL JOBS IN CLUSTERS

A Service for Data-Intensive Computations on Virtual Clusters

Deploying Business Virtual Appliances on Open Source Cloud Computing

International Journal of Scientific & Engineering Research, Volume 6, Issue 4, April ISSN

Load Balancing in the Cloud Computing Using Virtual Machine Migration: A Review

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Axceleon s CloudFuzion Turbocharges 3D Rendering On Amazon s EC2

New resource provision paradigms for Grid Infrastructures: Virtualization and Cloud

WORKFLOW ENGINE FOR CLOUDS

Volunteer Computing, Grid Computing and Cloud Computing: Opportunities for Synergy. Derrick Kondo INRIA, France

ABSTRACT. KEYWORDS: Cloud Computing, Load Balancing, Scheduling Algorithms, FCFS, Group-Based Scheduling Algorithm

CloudFTP: A free Storage Cloud

What Is It? Business Architecture Research Challenges Bibliography. Cloud Computing. Research Challenges Overview. Carlos Eduardo Moreira dos Santos

Cost-Benefit Analysis of Cloud Computing versus Desktop Grids

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Performance Analysis of Cloud-Based Applications

ENERGY EFFICIENT VIRTUAL MACHINE ASSIGNMENT BASED ON ENERGY CONSUMPTION AND RESOURCE UTILIZATION IN CLOUD NETWORK

Effective Virtual Machine Scheduling in Cloud Computing

SURFsara HPC Cloud Workshop

Deployment Strategies for Distributed Applications on Cloud Computing Infrastructures

Towards Autonomic Grid Data Management with Virtualized Distributed File Systems

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Computing. M< MORGAN KAUfMANM. Distributed and Cloud. Processing to the. From Parallel. Internet of Things. Geoffrey C. Fox. Dongarra.

LOGO Resource Management for Cloud Computing

Advanced Load Balancing Mechanism on Mixed Batch and Transactional Workloads

Grid Scheduling Architectures with Globus GridWay and Sun Grid Engine

Enterprise Desktop Grids

Cloud Computing. Lecture 5 Grid Case Studies

SLA Driven Load Balancing For Web Applications in Cloud Computing Environment

Article on Grid Computing Architecture and Benefits

Lecture 02a Cloud Computing I

Group Based Load Balancing Algorithm in Cloud Computing Virtualization

Transcription:

Grid and Cloud Scheduling and Resource Management Dick Epema ASCI a24 1 Parallel and Distributed Group/

Outline Introduction Resource Management in Grids versus Clouds Condor Flocking in Grids Processor Co-allocation Cycle Scavenging Performance Variability in Clouds ASCI a24 2

Grids and Clouds Grids: large, multi-organizational collections of heterogeneous machines (desktops, clusters, supercomputers) best-effort, no cost model Clouds: large, uni-organizational, homogeneous data centers weak SLAs, cost model ASCI a24 3

Grids versus Clouds heterogeneous homogeneous many types of systems datacenters multi-organizational clearly defined organization scientific applications web/ applications authentication issues authentication by credit card one user - many resources elasticity no industry involvement much industry involvement free (?) payment model energy awareness ----------------------------- + ------------------------------- + Grids Clouds ASCI a24 4

Job types in Grids and Clouds Sequential jobs/bags-of-tasks E.g., parameter sweep applications (PSAs) These account for the vast majority of jobs of grids Leads to high-throughput computing Parallel jobs Extension of high-performance computing In most grids, only a small fraction Workflows Multi-stage data filtering, etc Represented by directed (acyclic) graphs MapReduce applications (in clouds) Miscellaneous Interactive simulations Data-intensive applications (also in clouds) Web applications (in clouds) ASCI a24 5

A General Resource Management Framework (1) ASCI a24 6

A General Resource Management Framework (2) Local resource management system (Resource Level) Basic resource management unit Provides low level resource allocation and scheduling e.g., PBS, SGE, LSF Global resource management system (Grid Level) Coordinates the local resource management systems within multiple or distributed sites Provides high-level functionalities to efficiently use resources: Job submission Resource discovery and selection Authentication, authorization, accounting Scheduling Job monitoring Fault tolerance ASCI a24 7

How to select resources in the grid? Grid scheduling is the process of assigning jobs to grid resources Grid Resource Broker/ Scheduler Job Job Job Local Resource Manager Local Resource Manager Local Resource Manager Clusters (Space Shared Allocation) Single CPU (Time Shared Allocation) Clusters (Space Shared Allocation) ASCI a24 8

Resource Characteristics in Grids (1) Autonomous each resource has its own management policy or scheduling mechanism no central control/multi-organizational sets of resources Heterogeneous hardware (processor architectures, disks, network) basic software (OS, libraries) grid software (middleware) systems management (security set-up, runtime limits) -/+ ASCI a24 9

Resource Characteristics in Grids (2) Size large numbers of nodes, providers, consumers large amounts of data Varying Availability resources can join or leave to the grid at any time due to maintenance, policy reasons, and failures Insecure and unreliable environment prone to various types of attacks + -/+? ASCI a24 10

Problems in Grid Scheduling (1) 1. Grid schedulers do not own resources themselves they have to negotiate with autonomous local schedulers authentication/multi-organizational issues 2. Grid schedulers have to interface to different local schedulers some may have support for reservations, others are queuing-based some may support checkpointing, migration, etc 3. Structure of applications many different structures (parallel, PSAs, workflows, database, etc.) need for application adaptation modules =? ever more support ASCI a24 11

Problems in Grid scheduling (2) 4. Lack of a reservation mechanism but with such a mechanism we need good runtime estimates 5. Heterogeneity - see above 6. Failures monitor the progress of applications/sanity of systems only thing we know to do upon failures: restart (possibly from a checkpoint)? 7. Performance metric turn-around time cost 8. Reproducibility of performance experiments +/- ASCI a24 12

Example Grids: The Globus Toolkit Widely used grid middleware Negotiate and control access to resources GRAM component for control of local execution Globus toolkit includes software services and libraries for: resource monitoring resource specification language resource discovery and management security file management (gridftp) Was the dominant grid middleware in the period 1997-2006 Main designers: Ian Foster (Chicago) and Carl Kesselman (LA) Achievement award talk Ian Foster at HPDC 12 in Delft online at www.hpdc.org/2012 www.globus.org ASCI a24 13

Condor (1/8): batch queuing system + grids Condor is a high-throughput scheduling system started around 1986 as one of many batch queuing systems for clusters (of desktop machines), and has survived! supports cycle scavenging: use idle time on clusters of machines introduced the notions of matchmaking and classads provides remote system calls, a queuing mechanism, scheduling policies, priority scheme, resource monitoring initiated and still being developed by Miron Livny, Madison, Wisc. http://research.cs.wisc.edu/condor D.H.J. Epema, M. Livny, R. van Dantzig, X. Evers, and J. Pruyne, "A Worldwide Flock of Condors: Load Sharing among Workstation Clusters, Future Generation Computer Systems, Vol. 12, pp. 53-65, 1996. ASCI a24 14

Condor (2/8): matchmaking Basic operation of Condor: 1c 1a jobs send classads to the matchmaker 1b machines send classads to the matchmaker 1c the matchmaker matches jobs and machines 1d and notifies the submission machine 2a which starts a shadow process is that represents the remote job on the execution machine 2b/c and contacts the execution machine 3b/c on the execution machine, the actual remote user job is started ASCI a24 15

Condor (3/8): combining pools (design 1) Federation with a Flock Central Manager: FCM Disadvantages: who maintains the FCM? FCM is single point of failure the matchmakers have to be modified modified ASCI a24 16

Condor (4/8): combining pools (design 2) Protocol between Central Managers: Disadvantage: the matchmaker has to be modified ASCI a24 17

Condor (5/8): combining pools (design 3) Connecting pools network-style with GateWays: Condor flocking GW GW GW Advantages: transparent to the matchmaker no component to be maintained by a third party in the GWs, any access policies and resource sharing policies can be implemented ASCI a24 18

Condor (6/8): flocking as before submission pool execution pool as before MM MM Conclusions: Design considerations for Condor Flocking are still very valid when joining systems (cloud federations?) Nice, clear, transparent research solution that was too complex in practice submission machine GateWay presents itself as machine of other pool GateWay presents remote job as if it is its own execution machine ASCI a24 19

Condor (7/8): user flocking Conclusion: Condor and Condor flocking have survived 20 years of grid computing!! ASCI a24 20

Experimentation: DAS-4 VU (148 CPUs) SURFnet6 10 Gb/s lambdas UvA/MultimediaN (72) UvA (32) System purely for CS research Operational since oct. 2010 Specs: 1,600 cores (quad cores) 2.4 GHz CPUs accelerators (GPUs) 180 TB storage 10 Gb/s Infiniband Gb Ethernet TU Delft (64) Leiden (32) Astron (46) ASCI a24 21

KOALA: a co-allocating grid scheduler Original goals: 1. processor co-allocation: parallel applications 2. data co-allocation: job affinity based on data locations 3. load sharing: in the absence of co-allocation while being transparant for local schedulers Additional goals: research vehicle for grid and cloud research support for (other) popular application types KOALA is written in Java is middleware independent (initially Globus-based) has been deployed on the DAS2 - DAS4 since sept 2005 ASCI a24 22

KOALA: the runners The KOALA runners are adaptation modules for different application types: set up communication / name server / environment launch applications + perform application-level scheduling scheduling policies Current runners: CSRunner: for cycle-scavenging applications (PSAs) IRunner: for applications using the Ibis Java library Mrunner: for malleable parallel applications OMRunner: for co-allocated parallel OpenMPI applications Wrunner: for co-allocated workflows MR-runner: for MapReduce applications (under construction) ASCI a24 23

Co-Allocation (1) In grids, jobs may use resources in multiple sites: co-allocation or multi-site operation Reasons: to benefit from available resources (e.g., processors) to access and/or process geographically spread data application characteristics (e.g., simulation in one location, visualization in another) Resource possession in different sites can be: simultaneous (e.g., parallel applications) coordinated (e.g., workflows) With co-allocation: more difficult resource-discovery process single global job need to coordinate allocations by autonomous resource managers ASCI a24 24

Co-allocation (2): job types fixed job job components non-fixed job job component placement fixed scheduler decides on component placement flexible job total job same total job size scheduler decides on split up and placement single cluster of same total size ASCI a24 25

Co-allocation (3): slowdown Co-allocated applications are less efficient due to the relatively slow wide-area communications Slowdown of a job: execution time on multicluster execution time on single cluster (>1 usually) Processor co-allocation is a trade-off between faster access to more capacity, and higher utilization shorter execution times ASCI a24 26

Co-allocation (4): scheduling policies Placement policies dictate where the components of a job go Placement policies for non-fixed jobs: 1. Load-aware: Worst Fit (WF) (balance load in clusters) 2. Input-file-location-aware: Close-to-Files (CF) (reduce file-transfer times) 3. Communication-aware: Cluster Minimization (CM) (reduce number of wide-area messages) Placement policies for flexible jobs: 1. Communication-aware: Flexible Cluster (CM for flexible) Minimization (FCM) 2. Network-aware: Communication-Aware (take latency into account) (CA) ASCI a24 27

Co-allocation (5): simulations/analysis Model has a host of parameters Main conclusions: Co-allocation is beneficial when the slowdown 1.20 Unlimited co-allocation is no good: limit the number of job components limit the maximum job-component size Give local jobs some but not absolute priority over global jobs Mathematical analysis for maximal utilization See, e.g.: 1. A.I.D. Bucur and D.H.J. Epema, Trace-Based Simulations of Processor Co-Allocation Policies in Multiclusters, IEEE/ACM High Performance Distributed Computing (HPDC) 2003. 2. A.I.D. Bucur and D.H.J. Epema, Scheduling Policies for Processor Co-Allocation in Multicluster Systems," IEEE Trans. on Parallel and Distributed Systems, Vol. 18, pp. 958-972, 2007. ASCI a24 28

Co-Allocation (6): Experiments on the DAS3 average execution time (s) average job response time (s) Delft poorly connected number of clusters combined FCM CA FCM CA [w/o Delft] [with Delft] O.O. Sonmez, H.H. Mohamed, and D.H.J. Epema, On the Benefit of Processor Co-Allocation in Multicluster Grid Systems, IEEE Trans. on Parallel and Distributed Systems, Vol.21, pp. 778-789, 2010. ASCI a24 29

Cycle Scavenging (1): System Requirements 1. Unobtrusiveness Minimal delay for (higher priority) local and grid jobs 2. Fairness Multiple cycle scavenging applications running concurrently should be assigned comparable CPU-Time 3. Dynamic Resource Allocation Cycle scavenging applications have to Grow/Shrink at runtime 4. Efficiency As much use of dynamic resources as possible 5. Robustness and Fault Tolerance Long-running, complex system: problems will occur, and must be dealt with ASCI a24 30

Cycle Scavenging (2): Policies 1. Equipartition-All (grid-wide basis) CS User-1 Clusters CS User-2 CS User-3 Non-CS jobs 2. Equipartition-PerSite (per-cluster basis) CS User-1 Clusters CS User-2 CS User-3 Non-CS jobs ASCI a24 31

Cycle Scavenging (3): Experimental Results Number of Completed Tasks Equi-All Equi-All Equi-PerSite Equi-PerSite WBlock WBurst WBlock WBurst Equi-PerSite is fair and superior to Equi-All O.O. Sonmez, B. Grundeken, H.H. Mohamed, A. Iosup, and D.H.J. Epema, Scheduling Strategies for Cycle Scavenging in Multicluster Grid Systems," 9th IEEE/ACM Int'l Symposium on Cluster Computing and the Grid (CCGRID09), May 2009. ASCI a24 32

Research Questions A. Iosup, M.N. Yigitbasi, and D.H.J. Epema, On the Performance Variability of Production Cloud Services, 11th IEEE/ACM Int'l Symp. on Cluster, Cloud, and Grid Computing (CCGrid), pp. 104-113, 2011. ASCI a24 33

Outline 1. Introduction and Motivation 2. Method 3. Q1: Analysis of performance variability 1. Analysis of the AWS Dataset 2. Analysis of the GAE Dataset 4. Q2: Impact of variability on large-scale applications 1. Grid & PPE Job Execution 2. Selling Virtual Goods in Social Networks 3. Game Status Maintenance for Social Games ASCI a24 34

Production Cloud Services Amazon Web Services (AWS) IaaS EC2 (Elastic Compute Cloud) S3 (Simple Storage Service) SQS (Simple Queueing Service) SDB (Simple Database) FPS (Flexible Payment Service) Google App Engine (GAE) PaaS Run (Python/Java runtime) Datastore (Database) Memcache (Caching) URL Fetch (Web crawling) ASCI a24 35

Our Method (1): Performance Traces CloudStatus (www.cloudstatus.com, now defunct) real-time values and weekly averages for most of the AWS and GAE services values obtained with periodic performance probes (sampling rate is under 2 minutes) data obtained from the cloudstatus website using web crawls ASCI a24 36

Our Method (2): Analysis 1. Data selection Select several months worth of data to see whether a performance metric is highly variable 2. Find the characteristics of variability Basic statistics: the five quartiles (Q0-Q4) including the median (Q2), the mean, the standard deviation Derivative statistic: the IQR (Q3-Q1) E.g., coefficient of variation > 1.1 indicates high variability 3. Analyze the performance variability time patterns Investigate for each performance metric the presence of daily/ monthly/weekly/yearly time patterns E.g., for monthly patterns divide the dataset into twelve subsets and for each subset compute the statistics and plot for visual inspection ASCI a24 37

Our Method (3): Is Variability Present? Use statistics or visual inspection EC2 min Q1 median Q3 Max IQR mean Std Dev 57.0 73.6 75.7 78.5 122.1 4.9 76.62 5.17 ASCI a24 38

Outline 1. Introduction and Motivation 2. Method 3. Q1: Analysis of performance variability 1. Analysis of the AWS Dataset 2. Analysis of the GAE Dataset 4. Q2: Impact of variability on large-scale applications 1. Grid & PPE Job Execution 2. Selling Virtual Goods in Social Networks 3. Game Status Maintenance for Social Games ASCI a24 39

AWS Dataset (1): EC2 Deployment Latency [s]: Time it takes to start a small instance, from the startup to the time the instance is available Higher IQR and range in weeks 41-52 due to increasing EC2 user base? has impact on applications using EC2 auto-scaling ASCI a24 40

AWS Dataset (2): S3 Get Throughput [bytes/s]: Estimated rate at which an object in a bucket is read The last five months of the year exhibit much lower IQR and range more stable performance for the last five months may be due to software/infrastructure upgrades ASCI a24 41

AWS Dataset (3): SQS Variable Performance Stable Performance Average Lag Time [s]: Time it takes for a posted message to become available to read (average taken over multiple queues) Long periods of stability (low IQR and range) Periods of high performance variability also exist ASCI a24 42

GAE Dataset (1): Run Service Fibonacci [ms]: Time it takes to calculate the 27 th Fibonacci number Highly variable performance until September Last three months have stable performance (low IQR and range) ASCI a24 43

GAE Dataset (2): Datastore Read Latency [s]: Time it takes to read a User Group Yearly pattern from January to August? The last four months of the year exhibit much lower IQR and range ASCI a24 44

GAE Dataset (3): Memcache PUT [ms]: Time it takes to put 1 MB of data in memcache Median performance per month has an increasing trend over the first 10 months The last three months of the year exhibit stable performance ASCI a24 45

Outline 1. Introduction and Motivation 2. Background 3. Q1: Analysis of performance variability 1. Analysis of the AWS Dataset 2. Analysis of the GAE Dataset 4. Q2: Impact of variability on large-scale applications 1. Grid & PPE Job Execution 2. Selling Virtual Goods in Social Networks 3. Game Status Maintenance for Social Games ASCI a24 46

Experimental Setup (1): Simulations Trace-based simulations for three applications Input Grid Workload Archive traces (job execution) Number of daily unique users (other two apps) Monthly performance availibility/variability Application Service Job Execution Selling Virtual Goods Game Status Maintenance GAE Run AWS FPS AWS SDB + GAE Datastore ASCI a24 47

Experimental Setup (2): Metrics For Job Execution: average Response Time and Average Bounded Slowdown cost in millions of consumed CPU hours For the other two applications: Aggregate Performance Penalty -- APR(t) P(t): Pref : U(t): max U(t): performance at time t (in some metric) reference performance: Average of the twelve monthly medians number of users at time t max number of users over the whole trace the lower the better (how many users are hit?) ASCI a24 48

Game Status Maintenance (1): Scenario Maintenance of game status for a large-scale social game such as Farm Town, which has millions of unique users daily For such games, status maintenance is done with simple database operations Trace: number of users of Farm Town AWS SDB and GAE Datastore We assume that the number of database operations depends linearly on the number of daily unique users ASCI a24 49

Game Status Maintenance (2): Results AWS SDB GAE Datastore Big discrepancy between SDB and Datastore services Sep 09-Jan 10: APR of Datastore is well below than that of SDB due to increasing performance of Datastore APR of Datastore ~1 => no performance penalty APR of SDB ~1.4 => 40% larger performance penalty than SDB ASCI a24 50

Other issues in cloud performance How to efficiently use elasticity? How to use virtual machine migration? How to insulate VMs of different users from each other? How to meet SLAs (e.g., use predictions) How to benefit from data locality (same unit/rack/data center)? ASCI a24 51

The Archives: Grid Workloads Motivation: little is known about real grid use No grid workloads available (except my grid ) No standard way to share them The Grid Workloads Archive facilitates sharing grid workload traces and associated research: understand how real grids are used address challenges facing grid resource management develop and test solutions (simulations, experiments) A. Iosup, H. Li, M. Jan, S. Anoep, C. Dumitrescu, L. Wolters, D.H.J. Epema, The Grid Workloads Archive, Future Generation Computer Systems, Vol. 24, 672 686, 2008. ASCI a24 52