A science-gateway workload archive application to the self-healing of workflow incidents



Similar documents
Cloud Management: Knowing is Half The Battle

Delay Scheduling. A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

Appendix E. Captioning Manager system requirements. Installing the Captioning Manager

Characterizing Task Usage Shapes in Google s Compute Clusters

How To Test On The Dsms Application

A Service for Data-Intensive Computations on Virtual Clusters

Camilyo APS package by Techno Mango Service Provide Deployment Guide Version 1.0

A new binary floating-point division algorithm and its software implementation on the ST231 processor

How To Model A System

Analytics for Performance Optimization of BPMN2.0 Business Processes

MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT

Advanced Load Balancing Mechanism on Mixed Batch and Transactional Workloads

System Monitoring and Diagnostics Guide for Siebel Business Applications. Version 7.8 April 2005

Rose & Cylc Introduction

Using Promethee Methods for Multicriteria Pull-based scheduling in DCIs

An Energy-aware Multi-start Local Search Metaheuristic for Scheduling VMs within the OpenNebula Cloud Distribution

BiDAl: Big Data Analyzer for Cluster Traces

Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice

E-INVOICING Action Required: OB10 Registration

Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction

Citrix EdgeSight for Load Testing User s Guide. Citrix EdgeSight for Load Testing 3.8

Web Load Stress Testing

Citrix EdgeSight for Load Testing User s Guide. Citrx EdgeSight for Load Testing 2.7

WHAT WE NEED TO START THE PERFORMANCE TESTING?

Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace

NVIDIA Tools For Profiling And Monitoring. David Goodwin

NETWRIX ACCOUNT LOCKOUT EXAMINER

QAD Enterprise Applications. Training Guide Demand Management 6.1 Technical Training

Table of Contents INTRODUCTION Prerequisites... 3 Audience... 3 Report Metrics... 3

Process Optimizer Hands-on Exercise

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Scheduling in SAS 9.4 Second Edition

TECHNOLOGY WHITE PAPER Jan 2016

OPTIMIZED PERFORMANCE EVALUATIONS OF CLOUD COMPUTING SERVERS

Session Administration System (SAS) Manager s Guide

ACME Intranet Performance Testing

UniDesk Self Service Portal (SSP) User Guide

Setting deadlines and priorities to the tasks to improve energy efficiency in cloud computing

Performance Monitoring of Parallel Scientific Applications

Adobe Summit 2015 Lab 718: Managing Mobile Apps: A PhoneGap Enterprise Introduction for Marketers

Richmond SupportDesk Web Reports Module For Richmond SupportDesk v6.72. User Guide

HPC-Nutzer Informationsaustausch. The Workload Management System LSF

K1000: Advanced Topics

Dollar Universe SNMP Monitoring User Guide

ACCELERATOR 6.3 ALARMS AND OPERATIONAL MEASUREMENTS

DMS Performance Tuning Guide for SQL Server

Sonian Getting Started Guide October 2008

Performance Analysis of Web based Applications on Single and Multi Core Servers

SOLUTION BRIEF: SLCM R12.7 PERFORMANCE TEST RESULTS JANUARY, Load Test Results for Submit and Approval Phases of Request Life Cycle

Towards an understanding of oversubscription in cloud

CHAPTER 6 MAJOR RESULTS AND CONCLUSIONS

The Importance of Software License Server Monitoring

Pharos Control User Guide

Recommendations for Performance Benchmarking

VMware/Hyper-V Backup Plug-in User Guide

Big Data Management in the Clouds and HPC Systems

Windows Scheduled Task and PowerShell Scheduled Job Management Pack Guide for Operations Manager 2012

Capacity Management PinkVERIFY

Apparo Fast Edit. Excel data import via 1 / 19

1 How to Monitor Performance

NETWRIX USER ACTIVITY VIDEO REPORTER

4.0 SP1 ( ) November P Xerox FreeFlow Core Installation Guide: Windows Server 2008 R2

1 (11) Paperiton DMS Document Management System System Requirements Release: 2012/

Exclaimer Mail Archiver User Manual

SMock A Test Platform for the Evaluation of Monitoring Tools

Tableau Server Scalability Explained

Energy Efficient MapReduce

Dynamic Monitoring Interval to Economize SLA Evaluation in Cloud Computing Nor Shahida Mohd Jamail, Rodziah Atan, Rusli Abdullah, Mar Yah Said

Towards an Optimized Big Data Processing System

Optimizing Shared Resource Contention in HPC Clusters

Tutorial: Load Testing with CLIF

Setup Database as a Service using EM12c

Client Requirement. Why SharePoint

CHAPTER 3 CALL CENTER QUEUING MODEL WITH LOGNORMAL SERVICE TIME DISTRIBUTION

Application Performance Testing Basics

Oracle Data Integrator 11g: Integration and Administration

Scheduling in SAS 9.3

Table of Contents. 2 Copyright 2009 Bank of American Fork. All Rights Reserved.

Ocean Data Systems Ltd. The Art of Industrial Intelligence. Dream Report GETTING STARTED. User- Friendly Programming-Free Reporting for Automation

Oracle Data Integrator 12c: Integration and Administration

SOLGARI CLOUD BUSINESS COMMUNICATION SERVICES CLOUD CONTACT CENTRE MICROSOFT DYNAMICS INTEGRATION

Version Rev. 1.0

Transcription:

A science-gateway workload archive application to the self-healing of workflow incidents Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS Villeurbanne, France Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon Lyon, France Journées Scientifiques Mésocentres et France Grilles October 1st-3rd 2012 1

Context: Workload Archives Assumptions validation exit_code task_status submit_time site_name execution_time useful for Computational activity modeling input_file activity_name workflow_id Methods evaluation (simulation or experimental) Information produced by grid workflow executions 2

Science-gateway architecture User 0. Login 1. Send input data Web Portal 3. Launch workflow Workflow Engine 2. Transfer input files 4. Generate and submit task Storage Element 8. Get files 9. Execute 10. Upload results Computing site 7. Get task Pilot Manager 6. Schedule pilot jobs Meta-Scheduler 5. Submit pilot jobs 3

State of the Art Grid Workload Archives exit_code task_status submit_time site_name execution_time activity_name input_file workflow_id Information gathered at infrastructure-level tasks Lack of critical information: Dependencies among tasks Task sub-steps Application-level scheduling artifacts User Parallel Workloads Archive (http://www.cs.huji.ac.il/labs/parallel/workload/) Grid Workloads Archive (http://gwa.ewi.tudelft.nl/pmwiki/) 4

At infrastructure-level User 0. Login 1. Send input data Web Portal 3. Launch workflow Workflow Engine 2. Transfer input files 4. Generate and submit task Storage Element 8. Get files 9. Execute 10. Upload results Computing site 7. Get task Pilot Manager 6. Schedule pilot jobs Meta-Scheduler 5. Submit pilot jobs 5

Outline A science-gateway workload archive Case studies Pilot Jobs Accounting Task analysis Bag of tasks Workflow Self-Healing Conclusions 6

Our approach Science-Gateway Workload Archive exit_code task_status submit_time site_name execution_time activity_name input_file workflow_id Information gathered at science-gateway level Advantages: Fine-grained information about tasks Dependencies among tasks Workflow characterization Accounting workflow executions 7

At science-gateway level User 0. Login 1. Send input data Web Portal 3. Launch workflow Workflow Engine 2. Transfer input files 4. Generate and submit task Storage Element 8. Get files 9. Execute 10. Upload results Computing site 7. Get task Pilot Manager 6. Schedule pilot jobs Meta-Scheduler 5. Submit pilot jobs 8

Virtual Imaging Platform Virtual Imaging Platform (VIP) Medical imaging science-gateway Grid of 129 sites (EGI http://www.egi.eu) Significant usage Registered users: 244 from 26 countries Applications: 18 Consumed 32 CPU years in 2011 Applications File transfer VIP http://vip.creatis.insa-lyon.fr VIP usage in 2011: CPU consumption of VIP and related platforms on EGI. 9

SGWA Science Gateway Workload Archive (SGWA) Archive is extracted from VIP Science-gateway archive model Task, Site and Workflow Execution acquired from databases populated by the workflow engine at runtime File and Pilot Job extracted from the parsing of task standard output and error files 10

Workload for Case Studies Based on the workload of VIP January 2011 to April 2012 112 users 2,941 workflow executions 680,988 tasks 338,989 completed 138,480 error 105,488 aborted 15,576 aborted replicas 48,293 stalled 34,162 queued 339,545 pilot jobs 11

Pilot Jobs A single pilot can wrap several tasks and users At infrastructure-level Assimilates pilot jobs to tasks and users Valid for only 62% of the tasks Valid for 95% of user-task associations At science-gateway level Users and tasks are correctly associated to pilots Frequency Frequency 250000 200000 150000 100000 50000 0 300000 250000 200000 150000 100000 50000 0 282331 28121 11885 6721 10487 1 2 3 4 5 Tasks per pilot 323214 15178 1079 70 4 1 2 3 4 5 Users per pilot 12

Accounting: Users Authentications based on login and password are mapped to X.509 robot certificates At infrastructure-level All VIP users are reported as a single user At science-gateway level Maps task executions to VIP users 40 Users 30 20 EGI VIP 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Months 13 Number of reported EGI and VIP users

Accounting: CPU and Wall-clock Time Huge discrepancy of values Pilot jobs do not register to the pilot system Absence of workload Outputs unretrievable Pilot setup time Lost tasks (a.k.a. stalled) Number of jobs 6e+05 5e+05 4e+05 3e+05 2e+05 1e+05 150 VIP jobs EGI jobs 5 10 15 Month Number of submitted pilot jobs by EGI and VIP Undetectable at infrastructure-level Years 100 50 VIP CPU time VIP Wall clock time EGI CPU time EGI Wall clock time 5 10 15 Month Consumed CPU and wall-clock time by EGI and VIP 14

Task Analysis At infrastructure-level Limited to task exit codes 50000 55165 50925 48293 At science-gateway level Fine-grained information Steps in task life Error causes Replicas per task Number of tasks 40000 30000 20000 10000 0 1.0 0.8 CDF 0.6 0.4 19463 1123 application input stalled output folder Error causes download execution upload 0.2 1 100 10000 Time(s) Different steps in task life 15

Bag of Tasks: at Infrastructure level Evaluation of the accuracy of Iosup et al.[8] method to detect bag of tasks (BoT) Task 1 Two successively submitted tasks are in the same BoT if the time interval between submission times is lower or equal to. BoT 1 Task 2 1,2 2,3 Task 3 t 1 t 2 t 3 time BoT 2 Task 1 1,2 t 1 t 2 Task 3 2,3 > t 2 t 3 > Task 2 16 [8] Iosup, A., Jan, M., Sonmez, O., Epema, D.: The Characteristics and performance of groups of jobs in grids. In: Euro-Par. (2007) 382-393

Bag of Tasks: Size and Duration Infrastructure vs science-gateway 90% of Batch BoTs size ranges from 2 to 10 while it represents 50% of Real Batch CDF 0.8 0.6 0.4 0.2 0.0 Real Batch Batch 200 400 600 800 1000 Size (number of tasks) Non-Batch duration is overestimated up to 400% CDF 0.8 0.6 0.4 0.2 0.0 Real Batch Real Non Batch Batch Non Batch 10000 20000 30000 40000 50000 Duration (s) 17 Real Batch = ground-truth BoT Real Non-Batch = ground-truth non-bot Batch = Iosup et al. BoT Non-Batch = Iosup et al. non-bot

Bag of Tasks: Inter-arrival Time and Consumed CPU Time Batch and Non-Batch inter-arrival times are underestimated by about 30% CDF 0.8 0.6 0.4 0.2 0.0 Real Batch Real Non Batch Batch Non Batch 2000 4000 6000 8000 10000 Inter Arrival Time (s) CPU times are underestimated of 25% for Non-Batch and of about 20% for Batch CDF 0.8 0.6 0.4 0.2 Real Batch Real Non Batch Batch Non Batch 0 5000 10000 15000 20000 25000 30000 Consumed CPUTime (KCPUs) 18 Real Batch = ground-truth BoT Real Non-Batch = ground-truth non-bot Batch = Iosup et al. BoT Non-Batch = Iosup et al. non-bot

Outline A science-gateway workload archive Case studies Pilot Jobs Accounting Task analysis Bag of tasks Workflow Self-Healing Conclusions 19

Workflow Self-Healing Problem: costly manual operations Rescheduling tasks, restarting services, killing misbehaving experiments or replicating data files Objective: automated platform administration Autonomous detection of operational incidents Perform appropriate set of actions Assumptions: online and non-clairvoyant Only partial information available Decisions must be fast Production conditions, no user activity and workloads prediction 20

General MAPE-K loop event (job completion and failures) or timeout Monitoring Incident 1 degree = 0.8 level 1 level 2 level 3 Incident 2 degree = 0.4 level 1 level 2 level 3 Incident 3 degree = 0.1 level 1 level 2 level 3 Analysis Monitoring data 0.07 x 2 Set of Actions Frequency 0 5000 15000 0.0 0.2 0.4 0.6 0.8 1.0 Estimation by Median 0.30 0.61 = η i n η j j =1! b Execution Planning Knowledge Roulette wheel selection Rule Confidence ( ) x Selected 0.37 2 1 0.8 0.32 Selected Incident 2 0.66 3 1 0.2 0.02 Incident 1 0.16 Roulette wheel selection based on association rules 1 1 1.0 0.80 Association rules for incident 1 21

Incident: Activity Blocked An invocation is late compared to the others FIELD-II/pasa - workflow-9sienv Completed Jobs 0 20 40 60 80 100 0.0e+00 4.0e+06 8.0e+06 1.2e+07 Time (s) Invocations completion rate for a simulation Job flow for a simulation Possible causes Longer waiting times Lost tasks (e.g. killed by site due to quota violation) Resources with poor performance 22

Activity blocked: degree Degree computed from all completed jobs of the activity Job phases: setup inputs download execution outputs upload Assumption: bag-of-tasks (all jobs have equal durations) Median-based estimation: Median duration of jobs phases Estimated job duration Real job duration 50s 250s 42s 300s 42s 300s completed 400s 400s* 20s current 15s 15s? M i = 715s E i = 757s *: max(400s, 20s) = 400s Incident degree: job performance w.r.t median d = E i M i + E i [0,1] 23

Activity blocked: levels and actions Levels: identified from the platform logs τ 1 Frequency 0 5000 15000 Level 1 (no actions) 0.0 0.2 0.4 0.6 0.8 1.0 Estimation d by Median! b Level 2 action: replicate jobs Replication process for one task Actions Job replication Cancel replicas with bad performance Replicate only if all active replicas are running 24

Experiments Goal: Self-Healing vs No-Healing Cope with recoverable errors Metrics Makespan of the activity execution Resource waste w = (CPU + data) self healing (CPU + data) no healing 1 For w < 0: self-healing consumed less resources For w > 0: self-healing wasted resources 25

Experiment Conditions Software Virtual Imaging Platform MOTEUR workflow engine DIRAC pilot job system Infrastructure European Grid Infrastructure (EGI): production, shared Self-Healing and No-Healing launched simultaneously Experiment parameters Task and file replication limited to 5 Failed task resubmission limited to 5 26

Applications FIELD-II/pasa Ultrasound imaging simulation 122 invocations CPU Time: 15 min ~210 MB Data-intensive Mean-Shift/hs3 Image denoising 250 invocations CPU Time: 1 hour ~182 MB CPU-intensive Image courtesy of ANR project US-Tagging http://www.creatis.insa-lyon.fr/us-tagging/news O. Bernard, M. Alessandrini Image courtesy of Ting Li http://www.creatis.insa-lyon.fr 27

Results Experiment: tests if recoverable errors are detected FIELD-II/pasa Mean-Shift/hs3 12000 10000 20000 Makespan (s) 8000 6000 4000 2000 No Healing Self Healing Makespan (s) 15000 10000 5000 No Healing Self Healing 0 1 2 3 4 5 Repetitions 0 1 2 3 4 5 Repetitions speeds up execution up to 4 speeds up execution up to 2.6 Repetition w Repetition w 1 0.10 2 0.15 3 0.09 4 0.05 5 0.26 Self-Healing process reduced resource consumption up to 26% when compared to the No-Healing execution 1 0.02 2 0.20 3 0.02 4 0.02 5 0.01 28

Conclusions Science-gateway model of workload archive Illustration by using traces of the VIP from 2011/2012 Added value when compared to infrastructure-level traces Exactly identify tasks and users Distinguishes additional workload artifacts from real workload Fine-grained information about tasks Ground-truth of bag of tasks Self-healing of worklfow incidents Implements a generic MAPE-K loop Incident degrees computed online Speeds up execution up to a factor of 4 Reduced resource consumption up to 26% Successfull example of self-healing loop deployed in production VIP is openly available at http://vip.creatis.insa-lyon.fr Traces are available to the community in the Grid Observatory: http://www.grid-observatory.org 29

A science-gateway workload archive application to the self-healing of workflow incidents Thank you for your attention. Questions? ACKNOWLEDGMENTS VIP users and project members French National Agency for Research (ANR-09-COSI-03) European Grid Initiative (EGI) France-Grilles Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS Villeurbanne, France Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon Lyon, France 30