Using the SLURM Job Scheduler

Using the SLURM Job Scheduler [web] [email] portal.biohpc.swmed.edu biohpc-help@utsouthwestern.edu 1 Updated for 2015-02-11

Overview Today we re going to cover: Learn the basics of SLURM s architecture and demons Learn to how to use a basic set of commands Learn how to write a sbatch script for job submission This is only an introduction, but it should provide you a good start. 2

What is SLURM Simple Linux Utility for Resource Management - Started as a simple resource manger for Linux clusters - About 500,000 lines of C code today The glue for a parallel computer to execute parallel jobs - Make a parallel computer as almost easy to use as a PC - Typically use MPI to manage communications within the parallel program 3

Cluster Architecture (https://computing.llnl.gov/linux/slurm/overview.html) 4

BioHPC (current system and partition) storage systems: (Lysosome & Endosome ) /home2 /project /work Computer nodes: partitions Login via SSH to nucleus.biohpc.swmed.edu Nucleus005 (Login Node) Submit job via SLURM 64GB* : Nucleus010-025 128GB : Nucleus026-041 256GB : Nucleus042-049 (GPU : Nucleus048-049) 384GB : Nucleus050 super : Nucleus010-050 5

BioHPC (future partition) 64GB* : 8 nodes -> 0 nodes 128GB : 16 nodes -> 32 nodes 256GB : 8 nodes -> 32 nodes 384GB : 1 nodes -> 2 nodes super : 41 nodes -> 66 nodes GPU : 2 nodes -> 8 nodes total : 41 nodes -> 74 nodes 6

Role of SLURM (resource management & job scheduling) Fair-share resource allocations Allocate resources within a cluster - Supports hierarchical communications with configurable format - For each node, socket : cores per socket : threads per core = 2 : 8 : 2, therefore, total available parallel tasks within a single node = 2*8*2 = 32 When there is more work than resources, SLURM manages queues of work - request resource based on your needs - resource limits by queue, user, etc. 7

SLURM commands SLURM commands - Before job submission : sinfo, squeue, sview, smap - Submit a job : sbatch, srun, sallocate, sattach - During job running : squeue, sview, scontrol - After job completed : sacct 8

Commands: General Information Man pages available for all commands --help option prints brief descriptions of all options --usage option prints a list of the options Almost all options have two formats - A single letter option (e.g. -p super ) - A verbose option (e.g. --partition=super ) 9

sinfo DEMO Reports status of nodes of partitions (partition-oriented format by default) >sinfo --Node (report status in node-oriented form) >sinfo p 256GB (report status of nodes in partition 256GB ) 10

Life cycle of a job after submission Submit a Job Pending Configuration (node booting) Resizing Running Suspended Completing * BioHPC Policy : 16 running nodes per user Cancelled (scancel) Completed (zero exit code) Failed (non-zero exit code) Time out (time limit reached) 11

Demo: submit a job and use squeue, and scontrol to check job >squeue >scontrol show job {jobid} 12

Demo: cancel job with scancle, and use sacct to check previous jobs >scancel {jobid} >sacct u {username} >sacct j {jobid} 13

List of SLURM commands sbatch : Submit a script for later execution (batch mode) salloc : Create job allocation and start a shell to use it (interactive mode) srun : Create a job allocation (if needed) and launch a job step (typically an MPI job) sattach : Connect stdin/stdout/stderr for an existing job or job step squeue : Report job and job step status smap : Report system, job or step status with topology, less functionality than sview sview : Report/update system, job, step, partition or reservation status (GTK-based GUI) scontrol : Administrator tool to view/update system, job, step, partition or reservation status sacct : Report accounting information by individual job and job step 14

List of valid job states PENDING (PD) : Job is awaiting resource allocation RUNNING (R) : Job is currently has an allocation SUSPENDED (S) : Job has an allocation, but execution has been suspended COMPLETING (CG) : Job is in the process of completing COMPLETED (CD) : Job has terminated all processes on all nodes COMFIGURING (CF) : Job has been allocated resources, waiting for them to be ready for use CANCELLED (CA) : Job was explicitly cancelled by the user or system administrator FAILED (F) : Job terminated with non-zero code or other failure condition TIMEOUT (TO) : job terminated upon reaching its time limit NODE_FAIL (NF) : Job terminated with non-zero exit code or other failure condition 15

Demo 01 : submit a single job to single node #!/bin/bash #SBATCH --job-name=singlematlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=0-00:00:30 #SBATCH --output=single.%j.out #SBATCH --error=single.%j.time module add matlab/2013a format: D-H:M:S set up SLURM environment load software (export path & library) time(matlab -nodisplay -nodesktop -nosplash -r "forbiohpctestplot(1), exit") alternatively you may use: matlab nodisplay nodesktop nosplash < script.m Command (s) you want to execute Running time ~3.046 s 16

More sbatch options #SBATCH --begin=now+1hour Defer the allocation of the job until the specified time #SBATCH mail-type=all Notify user by email when certain event types occur (BEGIN, END, FAIL, REQUEUE, etc.) #SBATCH mail-user=yi.du@utsouthwestern.edu Use to reveive email notification of state changes as defined by mail-type #SBATCH --mem=65536 Specify the real memory required per node in MegaBytes. Default value is DefMemPerNode (UNLIMITED) #SBATCH --nodelist=nucleus0[10-20] Request a specific list of node names. The order of the node names in the list is not important, the node names will be sorted by SLURM 17

Demo 02 : submit multiple tasks to single node (sequential tasks) #!/bin/bash #SBATCH --job-name=multimatlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=0-00:00:30 #SBATCH --output=multiplejob.%j.out #SBATCH --error=smultiplejob.%j.time script.m module add matlab/2013a time(sh script.m) Running time ~44.904 s 18

Demo 03 : submit multiple tasks to single node (simultaneous tasks) version 1 #!/bin/bash #SBATCH --job-name=multicorematlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=0-00:00:30 #SBATCH --output=multiplecore.%j.out #SBATCH --error=smultiplecore.%j.time script.m module add matlab/2013a time(sh script.m) Running time ~6.156 s 19

Demo 04 : submit multiple jobs to single node (simultaneous tasks) version 2 #!/bin/bash #SBATCH --job-name=multi_matlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=0-00:00:30 #SBATCH --output=multiplecoreadv.%j.out #SBATCH --error=smultiplecoreadv.%j.time module add matlab/2013a time(sh script.m) script.m Running time ~4.904 s 20

Demo 05 : submit multiple jobs to single node with srun #!/bin/bash #SBATCH --job-name=srunsinglenodematlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH ntasks=16 #SBATCH --time=0-00:00:30 #SBATCH --output=srunsinglenode.%j.out #SBATCH --error=srunsinglenode.%j.time srun will create a resource relocation to run the parallel job SLURM_LOCALID : environment variable; Node local task ID for the process within a job. ( zero-based ) module add matlab/2013a time(sh script.m) script.m Running time ~4.596 s 21

Why srun? 4 slurmd 3 srun 1 2 7 slurmctld slurmstepd 6 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Node 0 srun will create a resource relocation to run the parallel job SLURM_LOCALID : environment variable; Node local task ID for the process within a job. ( zero-based ) 22

Demo 06 : submit multiple jobs to multi-node with srun #!/bin/bash #SBATCH --job-name=srun2nodesmatlab #SBATCH --partition=super #SBATCH --nodes=2 #SBATCH ntasks=16 #SBATCH --time=0-00:00:30 #SBATCH --output=srun2nodes.%j.out #SBATCH --error=srun2nodes.%j.time SLURM_NODEID : the relative node ID of the current node (zero-based) SLURM_NNODES : Total number of nodes in the job s resource allocation SLURM_NTASKS : Total number of processes in the current job module add matlab/2013a time(sh script.m) script.m Running time ~4.741 s 23

Demo 06 : submit multiple tasks to multi-node with srun Matrix 1-8 executed on Nucleus015 Matrix 9-16 executed on Nucleus016 Inside each node, job execution is out-of-order 24

Demo 07 : submit job with dependency >sbatch test_srun2nodes.sh >sbatch depend={jobid} gen_gif.sh 25

Parallelize your job At thread level - Shared Memory: multiple processors can operate independently but share the same memory - multi-threading - Fast but lower scalability (can not cross the node) e.g.: POSIX Threads, usually referred to as Pthreads - Implemented with a <pthread.h> header and a thread library - For C programming language - Thread management creating, joining threads etc. 26

Demo 08 : submit multi-core job to single node #!/bin/bash #SBATCH --job-name=phenix #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH ntasks=30 not necessary need to be specified here #SBATCH --time=0-20:00:00 #SBATCH --output=phenix.%j.out #SBATCH --error=phenix.%j.time Think of your job at a scientific aspect. If this scientific problem can be solved in parallel. module add phenix/1.9 phenix.den_refine model.pdb data.mtz nproc=30 Check with the software or read the manual to see if it is able to use multithreading. Try to use proper parameters If you need help, send email to biohpc-help@utsouthwestern.edu 27

Parallelize your job At node level - Distributed Memory - Processors have their own local memory - Memory is scalable with number of processors Q: How does a processor access the memory of another node? e.g. Message Passing Interface - Scales to over 100k processors - Distributed - Library of functions (for C, C++, Fortran), not a new language 28

Demo 09 : submit multi-node job with message passing #!/bin/bash #SBATCH --job-name=mpi_test #SBATCH --partition=super #SBATCH --nodes=2 #SBATCH ntasks=8 #SBATCH --time=0-00:00:30 #SBATCH --output=mpi_test.%j.out module add mvapich2/gcc/1.9 output mpirun./mpi_only total tasks / number of Node <= 32 memory needed for each task * tasks on each node <= 64GB/128GB/256GB/384GB 29

Demo 10 : submit multi-node job (e.g.: MPI-pthread) #!/bin/bash output #SBATCH --job-name=mpi_pthread #SBATCH --partition=super #SBATCH --nodes=2 #SBATCH ntasks=8 #SBATCH --time=0-00:00:30 #SBATCH --output=mpi_pthread.%j.out module add mvapich2/gcc/1.9 let NUM_THREADS=$SLURM_CPU_ON_NODE/($SLURM_NTASKS/$SLURM_NNODES) mpirun./mpi_pthread $NUM_THREADS more total MPI tasks * thread per tasks/ number of Node <= 32 memory needed for each MPI task * MPI tasks on each node <= 64GB/128GB/256GB/384GB similar strategy for MPI-openMP code, let OMP_NUM_THREAS =$NUM_THREADS 30

Demo 11: Submit a GPU job (e.g.: CUDA) #!/bin/bash #SBATCH --job-name=cuda_test #SBATCH --partition=gpu #SBATCH --gres=gpu:1 #SBATCH --time=0-00:10:00 #SBATCH output=cuda.%j.out #SBATCH --error=cuda.%j.err module add cuda65 Jobs will not be allocated any generic resources unless specifically requested at job submit time. Using the gres option supported by sbatch and srun. Format: --gres=gpu:[n], where n is the number of GPUs Use GPU partition A(320, 10240) B(10240, 320)./matrixMul wa=320 ha=10240 wb=10240 hb=320 31

Getting Effective Help Email the ticket system: biohpc-help@utsouthwestern.edu What is the problem? Provide any error message, and diagnostic output you have When did it happen? What time? Cluster or client? What job id? How did you run it? What did you run, what parameters, what do they mean? Any unusual circumstances? Have you compiled your own software? Do you customize startup scripts? Can we look at your scripts and data? Tell us if you are happy for us to access your scripts/data to help troubleshoot. 32