SLURM Workload Manager

Similar documents

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

General Overview. Slurm Training15. Alfred Gil & Jordi Blasco (HPCNow!)

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

Until now: tl;dr: - submit a job to the scheduler

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

An introduction to compute resources in Biostatistics. Chris Scheller

The Asterope compute cluster

Introduction to parallel computing and UPPMAX

Matlab on a Supercomputer

Tutorial-4a: Parallel (multi-cpu) Computing

Managing GPUs by Slurm. Massimo Benini HPC Advisory Council Switzerland Conference March 31 - April 3, 2014 Lugano

Submitting batch jobs Slurm on ecgate. Xavi Abellan User Support Section

High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina

Parallel Processing using the LOTUS cluster

Biowulf2 Training Session

Introduction to Supercomputing with Janus

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Streamline Computing Linux Cluster User Training. ( Nottingham University)

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Miami University RedHawk Cluster Working with batch jobs on the Cluster

Bright Cluster Manager 5.2. User Manual. Revision: Date: Fri, 30 Nov 2012

An Introduction to High Performance Computing in the Department

Debugging with TotalView

Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27.

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

NEC HPC-Linux-Cluster

The RWTH Compute Cluster Environment

Overview of HPC Resources at Vanderbilt

Simulation of batch scheduling using real production-ready software tools

A High Performance Computing Scheduling and Resource Management Primer

Slurm Workload Manager Architecture, Configuration and Use

Parallel Debugging with DDT

CHEOPS Cologne High Efficient Operating Platform for Science Brief Instructions

Introduction to HPC Workshop. Center for e-research

Slurm License Management

HPC Wales Skills Academy Course Catalogue 2015

Using NeSI HPC Resources. NeSI Computational Science Team

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

HPC-Nutzer Informationsaustausch. The Workload Management System LSF

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Using the Windows Cluster

Hodor and Bran - Job Scheduling and PBS Scripts

Running applications on the Cray XC30 4/12/2015

Job Scheduling with Moab Cluster Suite

Introduction to the SGE/OGS batch-queuing system

How to Run Parallel Jobs Efficiently

NYUAD HPC Center Running Jobs

Background and introduction Using the cluster Summary. The DMSC datacenter. Lars Melwyn Jensen. Niels Bohr Institute University of Copenhagen

Cloud Computing through Virtualization and HPC technologies

Remote & Collaborative Visualization. Texas Advanced Compu1ng Center

Cloud Bursting with SLURM and Bright Cluster Manager. Martijn de Vries CTO

SLURM Resources isolation through cgroups. Yiannis Georgiou Matthieu Hautreux

Chapter 2: Getting Started

1.0. User Manual For HPC Cluster at GIKI. Volume. Ghulam Ishaq Khan Institute of Engineering Sciences & Technology

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Using the Yale HPC Clusters

Agenda. Using HPC Wales 2

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Introduction to SDSC systems and data analytics software packages "

Compute Cluster Documentation

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)

R and High-Performance Computing

CNAG User s Guide. Barcelona Supercomputing Center Copyright c 2015 BSC-CNS December 18, Introduction 2

High Performance Computing

Job Scheduling Using SLURM

SGE Roll: Users Guide. Version Edition

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Getting Started with HPC

The Information Technology Solution. Denis Foueillassar TELEDOS project coordinator

The CNMS Computer Cluster

Grid Engine Users Guide p1 Edition

HPC at IU Overview. Abhinav Thota Research Technologies Indiana University

Virtualization of a Cluster Batch System

Windows HPC 2008 Cluster Launch

Batch Usage on JURECA Introduction to Slurm. Nov 2015 Chrysovalantis Paschoulas

Part I Courses Syllabus

Requesting Nodes, Processors, and Tasks in Moab

LSKA 2010 Survey Report Job Scheduler

Installing and running COMSOL on a Linux cluster

Quick Tutorial for Portable Batch System (PBS)

Slurm at CEA. Site Report CEA 10 AVRIL 2012 PAGE septembre SLURM User Group - September 2014 F. Belot, F. Diakhaté, A. Roy, M.

Microsoft Compute Clusters in High Performance Technical Computing. Björn Tromsdorf, HPC Product Manager, Microsoft Corporation

Microsoft HPC. V 1.0 José M. Cámara (checam@ubu.es)

Hack the Gibson. John Fitzpatrick Luke Jennings. Exploiting Supercomputers. 44Con Edition September Public EXTERNAL

Batch Job Analysis to Improve the Success Rate in HPC

LoadLeveler Overview. January 30-31, IBM Storage & Technology Group. IBM HPC Developer TIFR, Mumbai

New High-performance computing cluster: PAULI. Sascha Frick Institute for Physical Chemistry

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria

Informationsaustausch für Nutzer des Aachener HPC Clusters

Transcription:

SLURM Workload Manager

What is SLURM? SLURM (Simple Linux Utility for Resource Management) is the native scheduler software that runs on ASTI's HPC cluster. Free and open-source job scheduler for the Linux kernel used by many of the world's supercomputers and computer clusters. (On the November 2013 Top500 list, five of the ten top systems use Slurm) Users request for allocation of compute resources through SLURM. Slurm has three key functions: allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. it arbitrates contention for resources by managing a queue of pending work.

SLURM Partitions SLURM Entities Nodes Compute resource managed by SLURM Parititions Jobs Logical set of nodes with the same queue parameters (job size limit, job time limit, users permitted to use it, etc) allocations of resources assigned to a user for a specified amount of time Jobs Step which are sets of (possibly parallel) tasks within a job

http://slurm.schedmd.com/entities.gif

Types of Jobs Multi-node parallel jobs Use more than one node and require MPI to communicate between nodes Jobs usually require computing resource (cores) more than a single node can offer Single node parallel jobs Use only one node, but multiple cores on that node Includes pthreads, OpenMP and shared memory MPI Truly serial jobs Require only one core on one node

SLURM Partitions CoARE's HPC cluster is broken up into two (2) separate partitions: batch debug Suitable for jobs that take a long time to finish (<= 7 days) Six (6) nodes may be allocated to any single job Each job can allocate up to 4GB of memory per CPU core Default partition when the partition directive is unfilled in a request Queue for small/short jobs Maximum run time limit per job is 60 minutes or 1 hour Best for interactive usage (e.g. compiling, debugging)

Job Limits Every job submitted by users to SLURM is subject to some kind of limits (quota): Users can request up to 168 hours (1 week, 7 days) for a single job Users can request up to 288 CPU cores (this can be just one job or allocated to multiple jobs)

SLURM Job Script Part 1 #!/bin/bash #SBATCH --partition=batch #SBATCH nodes=2 #SBATCH ntasks-per-node=8 #SBATCH --job-name="hello_test" #SBATCH --output=test-srun.out #SBATCH mail-user=bert@asti.dost.gov.ph #SBATCH --mail-type=all #SBATCH --requeue

SLURM Job Script Part 2 # Print environment variable related to SLURM; Useful for debugging echo "SLURM_JOBID="$SLURM_JOBID echo "SLURM_JOB_NODELIST"=$SLURM_JOB_NODELIST echo "SLURM_NNODES"=$SLURM_NNODES echo "SLURMTMPDIR="$SLURMTMPDIR echo "working directory = "$SLURM_SUBMIT_DIR # Load modules module load intel/compiler/14.0.3 module load mvapich2/2.1-intel # print currently loaded module list # Set stack size to unlimited ulimit -s unlimited

SLURM Job Script Part 3 # Launch your application # Replace $NUM with a number of processors # This example executes a job step srun./some/program srun -n $NUM mpirexec /some/program

Job Management Check the status of nodes in the cluster sinfo Submit the job script to the queue sbatch name of file.slurm Check the status of the job squeue u user scontrol show jobs Cancel a Job scancel ${JOB_ID}

Job Management Check the status of nodes in the cluster $ sinfo $ scontrol show nodes Will provide information of the state of the nodes in the cluster (idle, mix, down) Shows the partition names, node allocation and availability Consult the sinfo man pages for more info on usage and options

Job Management Submit the job script to the queue $ sbatch name of file.slurm $ sbatch p debug name of file.slurm sbatch submits a batch script to SLURM. Batch script contains options that are preceded with the #SBATCH before any executables Specifying options when executing sbatch on the command line overrides the options set within the batch file Consult the sbatch man pages for more info on usage and options

Interactive Jobs Helpful for program debugging and running many short jobs Commands can be run through terminal directly on the compute nodes salloc p debug n 1 env grep SLURM module load intel/compilers/14.0 srun hostname exit

Job Arrays Job Arrays The basic idea is to write a single script that functions as a template for all the jobs that need to be run. This template can be passed a set of inputs and will process each one a separate job Best fit for jobs that are independently parallel (no communication or dependency between jobs)

Sample Script #SBATCH --job-name=jobarry.txt #SBATCH --array=1-100 #SBATCH --ntasks=1 filename=`ls inputs/ tail -n +${SLURM_ARRAY_TASK_ID} head -1` # use that file as an input /some/app < inputs/$filename

Use Cases Executing Serial Jobs #SBATCH partition=batch #SBATCH nodes=1

Use Cases Executing Parallel Jobs on One Node #SBATCH partition=batch #SBATCH nodes=1 #SBATCH ntasks per node=8 or #SBATCH ntasks=8

Use Cases Executing Parallel Jobs on Multiple Nodes A job requires 128 cores spread #SBATCH partition=batch #SBATCH nodes=3 #SBATCH ntasks per node=48

Exercise Download Intel Optimized LINPACK Benchmark: http://registrationcenter.intel.com/irc_nas/7615/l_lpk_p_11.3.0.004.tgz Extract the package and put the content into ~/scratch2/intel Download the sample slurm job script: http://ftp.pregi.net/pub/hpc-training/sample.slurm Delete all SLURM directives except for lines specifying partitions and/or nodes. Submit a job requesting for only 1 processor. The job must execute the program xlinpack_xeon64. Check the status of the job using sinfo. Determine the node where the program is running on. To validate if indeed the program has executed, ssh into the node and list all your processses using `ps ax`.

Exercise Download HPL http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz Untar the content to $SCRATCH2 Change directory into $SCRATCH2/hpl-2.1 and obtain the Makefile that will be used in the compilation process http://ftp.pregi.net/pub/hpc-training/make.linux_int el64_mkl Run s