SLURM Workload Manager
What is SLURM? SLURM (Simple Linux Utility for Resource Management) is the native scheduler software that runs on ASTI's HPC cluster. Free and open-source job scheduler for the Linux kernel used by many of the world's supercomputers and computer clusters. (On the November 2013 Top500 list, five of the ten top systems use Slurm) Users request for allocation of compute resources through SLURM. Slurm has three key functions: allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. it arbitrates contention for resources by managing a queue of pending work.
SLURM Partitions SLURM Entities Nodes Compute resource managed by SLURM Parititions Jobs Logical set of nodes with the same queue parameters (job size limit, job time limit, users permitted to use it, etc) allocations of resources assigned to a user for a specified amount of time Jobs Step which are sets of (possibly parallel) tasks within a job
http://slurm.schedmd.com/entities.gif
Types of Jobs Multi-node parallel jobs Use more than one node and require MPI to communicate between nodes Jobs usually require computing resource (cores) more than a single node can offer Single node parallel jobs Use only one node, but multiple cores on that node Includes pthreads, OpenMP and shared memory MPI Truly serial jobs Require only one core on one node
SLURM Partitions CoARE's HPC cluster is broken up into two (2) separate partitions: batch debug Suitable for jobs that take a long time to finish (<= 7 days) Six (6) nodes may be allocated to any single job Each job can allocate up to 4GB of memory per CPU core Default partition when the partition directive is unfilled in a request Queue for small/short jobs Maximum run time limit per job is 60 minutes or 1 hour Best for interactive usage (e.g. compiling, debugging)
Job Limits Every job submitted by users to SLURM is subject to some kind of limits (quota): Users can request up to 168 hours (1 week, 7 days) for a single job Users can request up to 288 CPU cores (this can be just one job or allocated to multiple jobs)
SLURM Job Script Part 1 #!/bin/bash #SBATCH --partition=batch #SBATCH nodes=2 #SBATCH ntasks-per-node=8 #SBATCH --job-name="hello_test" #SBATCH --output=test-srun.out #SBATCH mail-user=bert@asti.dost.gov.ph #SBATCH --mail-type=all #SBATCH --requeue
SLURM Job Script Part 2 # Print environment variable related to SLURM; Useful for debugging echo "SLURM_JOBID="$SLURM_JOBID echo "SLURM_JOB_NODELIST"=$SLURM_JOB_NODELIST echo "SLURM_NNODES"=$SLURM_NNODES echo "SLURMTMPDIR="$SLURMTMPDIR echo "working directory = "$SLURM_SUBMIT_DIR # Load modules module load intel/compiler/14.0.3 module load mvapich2/2.1-intel # print currently loaded module list # Set stack size to unlimited ulimit -s unlimited
SLURM Job Script Part 3 # Launch your application # Replace $NUM with a number of processors # This example executes a job step srun./some/program srun -n $NUM mpirexec /some/program
Job Management Check the status of nodes in the cluster sinfo Submit the job script to the queue sbatch name of file.slurm Check the status of the job squeue u user scontrol show jobs Cancel a Job scancel ${JOB_ID}
Job Management Check the status of nodes in the cluster $ sinfo $ scontrol show nodes Will provide information of the state of the nodes in the cluster (idle, mix, down) Shows the partition names, node allocation and availability Consult the sinfo man pages for more info on usage and options
Job Management Submit the job script to the queue $ sbatch name of file.slurm $ sbatch p debug name of file.slurm sbatch submits a batch script to SLURM. Batch script contains options that are preceded with the #SBATCH before any executables Specifying options when executing sbatch on the command line overrides the options set within the batch file Consult the sbatch man pages for more info on usage and options
Interactive Jobs Helpful for program debugging and running many short jobs Commands can be run through terminal directly on the compute nodes salloc p debug n 1 env grep SLURM module load intel/compilers/14.0 srun hostname exit
Job Arrays Job Arrays The basic idea is to write a single script that functions as a template for all the jobs that need to be run. This template can be passed a set of inputs and will process each one a separate job Best fit for jobs that are independently parallel (no communication or dependency between jobs)
Sample Script #SBATCH --job-name=jobarry.txt #SBATCH --array=1-100 #SBATCH --ntasks=1 filename=`ls inputs/ tail -n +${SLURM_ARRAY_TASK_ID} head -1` # use that file as an input /some/app < inputs/$filename
Use Cases Executing Serial Jobs #SBATCH partition=batch #SBATCH nodes=1
Use Cases Executing Parallel Jobs on One Node #SBATCH partition=batch #SBATCH nodes=1 #SBATCH ntasks per node=8 or #SBATCH ntasks=8
Use Cases Executing Parallel Jobs on Multiple Nodes A job requires 128 cores spread #SBATCH partition=batch #SBATCH nodes=3 #SBATCH ntasks per node=48
Exercise Download Intel Optimized LINPACK Benchmark: http://registrationcenter.intel.com/irc_nas/7615/l_lpk_p_11.3.0.004.tgz Extract the package and put the content into ~/scratch2/intel Download the sample slurm job script: http://ftp.pregi.net/pub/hpc-training/sample.slurm Delete all SLURM directives except for lines specifying partitions and/or nodes. Submit a job requesting for only 1 processor. The job must execute the program xlinpack_xeon64. Check the status of the job using sinfo. Determine the node where the program is running on. To validate if indeed the program has executed, ssh into the node and list all your processses using `ps ax`.
Exercise Download HPL http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz Untar the content to $SCRATCH2 Change directory into $SCRATCH2/hpl-2.1 and obtain the Makefile that will be used in the compilation process http://ftp.pregi.net/pub/hpc-training/make.linux_int el64_mkl Run s