Slurm Fault Tolerant Workload Management



Similar documents
Availability Digest. MySQL Clusters Go Active/Active. December 2006

Until now: tl;dr: - submit a job to the scheduler

The Service Availability Forum Specification for High Availability Middleware

ITG Software Engineering

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

SLURM Workload Manager

Guide to SATA Hard Disks Installation and RAID Configuration

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

A High Performance Computing Scheduling and Resource Management Primer

System Monitoring Using NAGIOS, Cacti, and Prism.

<Insert Picture Here> Oracle In-Memory Database Cache Overview

Technical Overview Simple, Scalable, Object Storage Software

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

Developing Microsoft Azure Solutions 20532B; 5 Days, Instructor-led

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Slurm License Management

OVERVIEW. CEP Cluster Server is Ideal For: First-time users who want to make applications highly available

Cluster APIs. Cluster APIs

StreamServe Persuasion SP5 Control Center

ORACLE NOSQL DATABASE HANDS-ON WORKSHOP Cluster Deployment and Management

ITIL Event Management in the Cloud

A Study of Application Recovery in Mobile Environment Using Log Management Scheme

White Paper. Optimizing the Performance Of MySQL Cluster

Diagram 1: Islands of storage across a digital broadcast workflow

In Memory Accelerator for MongoDB

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Oracle Communications WebRTC Session Controller: Basic Admin. Student Guide

Certified Selenium Professional VS-1083

Rsync: The Best Backup System Ever

DEPLOYMENT GUIDE DEPLOYING F5 AUTOMATED NETWORK PROVISIONING FOR VMWARE INFRASTRUCTURE

Submitting batch jobs Slurm on ecgate. Xavi Abellan User Support Section

Fundamentals of LoadRunner 9.0 (2 Days)

Red Hat Enterprise linux 5 Continuous Availability

Unified Batch & Stream Processing Platform

Job Scheduling with Moab Cluster Suite

COMP5426 Parallel and Distributed Computing. Distributed Systems: Client/Server and Clusters

A new Cluster Resource Manager for heartbeat

General Overview. Slurm Training15. Alfred Gil & Jordi Blasco (HPCNow!)

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Guide to SATA Hard Disks Installation and RAID Configuration

Performance Testing Web 2.0

Course 20532B: Developing Microsoft Azure Solutions

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

Assignment # 1 (Cloud Computing Security)

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

STUDY AND SIMULATION OF A DISTRIBUTED REAL-TIME FAULT-TOLERANCE WEB MONITORING SYSTEM

Using Oracle NoSQL Database

SCADA Questions and Answers

Print Audit 6 Technical Overview

The Asterope compute cluster

Monitoring Drupal with Sensu. John VanDyk Iowa State University DrupalCorn Iowa City August 10, 2013

Practical issues in DIY RAID Recovery

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Managing Storage Using RAID

Developing Microsoft Azure Solutions

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

4D as a Web Application Platform

Nagios in High Availability Environments

Developing Microsoft Azure Solutions 20532A; 5 days

PASS4TEST 専 門 IT 認 証 試 験 問 題 集 提 供 者

Tool for Automated Provisioning System (TAPS) Version 1.2 (1027)

Fault Tolerant Solutions Overview

The EMSX Platform. A Modular, Scalable, Efficient, Adaptable Platform to Manage Multi-technology Networks. A White Paper.

Big Data Course Highlights

Planning for a Disaster Using Tivoli Storage Manager. Laura G. Buckley Storage Solutions Specialists, Inc.

Chapter 1 - Web Server Management and Cluster Topology

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

An introduction to compute resources in Biostatistics. Chris Scheller

This module provides an overview of service and cloud technologies using the Microsoft.NET Framework and the Windows Azure cloud.

The Hadoop Distributed File System

HP LoadRunner: Essentials 11

Agent Languages. Overview. Requirements. Java. Tcl/Tk. Telescript. Evaluation. Artificial Intelligence Intelligent Agents

The ConTract Model. Helmut Wächter, Andreas Reuter. November 9, 1999

EMC DOCUMENTUM xplore 1.1 DISASTER RECOVERY USING EMC NETWORKER

Ensure Continuous Availability and Real-Time Monitoring of your HP NonStop Systems with ESQ Automated Operator

Grid Computing in SAS 9.4 Third Edition

The Promise of Virtualization for Availability, High Availability, and Disaster Recovery - Myth or Reality?

CloudVPS Backup Manual. CloudVPS Backup Manual

MPI / ClusterTools Update and Plans

Open source framework for interactive data exploration in server based architecture

Distributed Systems (CS236351) Exercise 3

LSKA 2010 Survey Report Job Scheduler

Deploying to WebSphere Process Server and WebSphere Enterprise Service Bus

NetNumen U31 R06. Backup and Recovery Guide. Unified Element Management System. Version: V

DiskBoss. File & Disk Manager. Version 2.0. Dec Flexense Ltd. info@flexense.com. File Integrity Monitor

Big Business, Big Data, Industrialized Workload

Data Management in the Cloud

Network Monitoring. Chu-Sing Yang. Department of Electrical Engineering National Cheng Kung University

Installing Skype for Business Server 2015

Job scheduler details

Highly Available Mobile Services Infrastructure Using Oracle Berkeley DB

Transcription:

Slurm Fault Tolerant Workload Management 19 October 2013 David Bigagli, Morris Jette [david,jette]@schedmd.com

Motivation Failures in large computers are inevitable. Bigger the size of the parallel job higher is the probability of failure at runtime. Implement failure recovery services in Slurm which can be used by running applications to respond to observed or anticipated failures.

Current approach Workload manager based: Using re-runnable jobs Using job dependencies Allocating extra resources Application based: System-level checkpointing Application-level checkpointing If failures are common, the impact on application performance is significant

Helping applications be resilient (Dr. William Kramer, NCSA, SUG 2011) Applications are able to reallocate work The resource manager provides assistance to substitute resource with just in time delivery A protocol between resource manager and application to negotiate the best solution: System: Your node x just broke App: Can I have another node to replace it System: Yes, but not for 50 minutes App: Ok then just drop it and extend my runtime

Helping applications be resilient (Dr. William Kramer, NCSA, SUG 2011) Another example: App: My node Y is not responding System: I can give you another one in 5 minutes App: Can you make it 2 nodes so I can make up the lost time System: Yes, but not for 7 minutes App: Also adjust my time limit by 20 minutes System: I have something else waiting, but can give you 10 minutes App: Ok

Slurm failure management infrastructure Failed hosts, currently out of service Failing hosts, malfunctioning and/or expected to fail Hot Spare A cluster-wide pool of resources to be made available to jobs with failed/failing nodes The hot spare pool is partition based, the administrator specifies how many spares in a given partition Any node in the partition can be part of the spare pool Access control list to the hot spare indicating which user/group may or may not use it

Slurm failure management infrastructure Failing hosts can be drained and then dropped from the allocation, giving application flexibility to manage its own resources. Drained nodes can be put back on-line by the administrator and they will go automatically back to the spare pool. Failed nodes can be put back on-line and they will go automatically back to the spare pool.

Slurm failure management infrastructure Application usually detects the failure by itself, losing one or more of its component Able to notify Slurm of failures and drain nodes Application can also query Slurm about state of nodes in its allocation Application asks Slurm to replace its failed/failing nodes, then it can Wait for nodes become available, eventually increasing its runtime till then Increase its runtime upon node replacement Drop the nodes and continue, eventually increasing its runtime

Slurm architecture Slurmctld plugin keeps track of the spare pool and the job allocation status libsmd.so, libsmd.a and smd.h are the client interface and library to the non stop services based on a jobid snonstop command build on top of the library provides command line interface to the recovery services. nonstop.sh shell script which automates the node replacement based on user supplied environment variables.

SnonStop usage The application runs unchanged and at every step it checks the health of its nodes. The application may link with the libsmd.so library and use the nonstop API to retrieve the node status and take action to replace them. The application may link with the libsmd.so library and subscribes for events which will be delivered asynchronously. The job has to be submitted to Slurm using the --no-kill option to prevent it being killed upon component failure.

SnonStop common use case #!/bin/sh # Set the environment variable to handle # runtime node failure. export SMD_NONSTOP_FAILED=REPLACE:TIME_LIMIT_DELAY=10:EXIT_JOB i=0 while [ $i -le 100 ] do # Run the $i step of my application srun myapp # Detect failure and execute actions nonstop.sh if [ $? -ne 0 ]; then exit 1 fi let i=i+1 done

SnonStop configuration Environment variables to determine the nonstop action to recover nodes SMD_NONSTOP_FAILED or SMD_NONSTOP_FAILING = REPLACE:DROP:EXIT_JOB TIME_LIMIT_DELAY TIME_LIMIT_EXTEND TIME_LIMIT_DROP

nonstop.conf # ControlAddr=prometeo # Debug=0 Port=34000 UserDrainAllow=david #UserDrainDeny=david # # Extend time upon node replacement TimeLimitExtend=15 # Extend time upon node drop TimeLimitDrop=22 # Extend time while attempting to replace node TimeLimitDelay=12 # HotSpareCount=bootes:2 # MaxSpareNodeCount=2

Example of use Termination of a component of the parallel job causing the step to abort If there are enough nodes the replacement is automatic srun: error: achab5: tasks 16-23: Killed is_failed: job 130 searching for FAILED hosts is_failed: job 130 has 1 FAILED node(s) is_failed: job 130 FAILED node achab6 cpu_count 8 _handle_fault: job 130 handle failed_hosts _try_replace: job 130 trying to replace 1 node(s) _try_replace: job 130 node achab6 replaced by achab7 _generate_node_file: job 130 all nodes replaced source the /tmp/smd_job_130_nodes.sh hostfile to get the new job environment _try_replace: job 130 all nodes replaced all right

Discussion Question and answers. Thank you!