Using the SLURM Job Scheduler

Similar documents
SLURM Workload Manager

General Overview. Slurm Training15. Alfred Gil & Jordi Blasco (HPCNow!)

Until now: tl;dr: - submit a job to the scheduler

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Submitting batch jobs Slurm on ecgate. Xavi Abellan User Support Section

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

An introduction to compute resources in Biostatistics. Chris Scheller

Managing GPUs by Slurm. Massimo Benini HPC Advisory Council Switzerland Conference March 31 - April 3, 2014 Lugano

Introduction to parallel computing and UPPMAX

The Asterope compute cluster

Slurm Workload Manager Architecture, Configuration and Use

Matlab on a Supercomputer

Biowulf2 Training Session

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Batch Usage on JURECA Introduction to Slurm. Nov 2015 Chrysovalantis Paschoulas

Job Scheduling Using SLURM

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Tutorial-4a: Parallel (multi-cpu) Computing

Introduction to Supercomputing with Janus

An introduction to Fyrkat

Miami University RedHawk Cluster Working with batch jobs on the Cluster

A High Performance Computing Scheduling and Resource Management Primer

Simulation of batch scheduling using real production-ready software tools

Compute Cluster Documentation

Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27.

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

Running applications on the Cray XC30 4/12/2015

An Introduction to High Performance Computing in the Department

NEC HPC-Linux-Cluster

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

The RWTH Compute Cluster Environment

Parallel Processing using the LOTUS cluster

Streamline Computing Linux Cluster User Training. ( Nottingham University)

Running COMSOL in parallel

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Cloud Bursting with SLURM and Bright Cluster Manager. Martijn de Vries CTO

Using NeSI HPC Resources. NeSI Computational Science Team

Microsoft HPC. V 1.0 José M. Cámara (checam@ubu.es)

8/15/2014. Best (and more) General Information. Staying Informed. Staying Informed. Staying Informed-System Status

Using the Windows Cluster

HPC Wales Skills Academy Course Catalogue 2015

Introduction to HPC Workshop. Center for e-research

Parallel Debugging with DDT

Job Scheduling with Moab Cluster Suite

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

HPC at IU Overview. Abhinav Thota Research Technologies Indiana University

Hodor and Bran - Job Scheduling and PBS Scripts

Bright Cluster Manager 5.2. User Manual. Revision: Date: Fri, 30 Nov 2012

Installing and running COMSOL on a Linux cluster

How to Run Parallel Jobs Efficiently

Using the Yale HPC Clusters

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Martinos Center Compute Clusters

Juropa. Batch Usage Introduction. May 2014 Chrysovalantis Paschoulas

HPC-Nutzer Informationsaustausch. The Workload Management System LSF

Introduction to Sun Grid Engine (SGE)

Debugging with TotalView

Job scheduler details

Quick Tutorial for Portable Batch System (PBS)

SLURM Resources isolation through cgroups. Yiannis Georgiou Matthieu Hautreux

User s Manual

Manual for using Super Computing Resources

The Application Level Placement Scheduler

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)

Grid Engine Users Guide p1 Edition

Remote & Collaborative Visualization. Texas Advanced Compu1ng Center

Getting Started with HPC

1.0. User Manual For HPC Cluster at GIKI. Volume. Ghulam Ishaq Khan Institute of Engineering Sciences & Technology

MFCF Grad Session 2015

CHEOPS Cologne High Efficient Operating Platform for Science Brief Instructions

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Slurm License Management

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

PSE Molekulardynamik

locuz.com HPC App Portal V2.0 DATASHEET

How To Run A Tompouce Cluster On An Ipra (Inria) (Sun) 2 (Sun Geserade) (Sun-Ge) 2/5.2 (

High-Performance Computing

GPUs for Scientific Computing

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

bwgrid Treff MA/HD Sabine Richling, Heinz Kredel Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim 20.

Overview of HPC Resources at Vanderbilt

bwgrid Treff MA/HD Sabine Richling, Heinz Kredel Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim 29.

MPI / ClusterTools Update and Plans

Scalability and Classifications

LSKA 2010 Survey Report Job Scheduler

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria

Kiko> A personal job scheduler

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

Slurm Fault Tolerant Workload Management

Grid 101. Grid 101. Josh Hegie.

Spectrum Technology Platform. Version 9.0. Spectrum Spatial Administration Guide

GC3: Grid Computing Competence Center Cluster computing, I Batch-queueing systems

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

A Performance Data Storage and Analysis Tool

Batch Scheduling and Resource Management

Grid Engine 6. Troubleshooting. BioTeam Inc.

OLCF Best Practices. Bill Renaud OLCF User Assistance Group

Requesting Nodes, Processors, and Tasks in Moab

Transcription:

Using the SLURM Job Scheduler [web] [email] portal.biohpc.swmed.edu biohpc-help@utsouthwestern.edu 1 Updated for 2015-02-11

Overview Today we re going to cover: Learn the basics of SLURM s architecture and demons Learn to how to use a basic set of commands Learn how to write a sbatch script for job submission This is only an introduction, but it should provide you a good start. 2

What is SLURM Simple Linux Utility for Resource Management - Started as a simple resource manger for Linux clusters - About 500,000 lines of C code today The glue for a parallel computer to execute parallel jobs - Make a parallel computer as almost easy to use as a PC - Typically use MPI to manage communications within the parallel program 3

Cluster Architecture (https://computing.llnl.gov/linux/slurm/overview.html) 4

BioHPC (current system and partition) storage systems: (Lysosome & Endosome ) /home2 /project /work Computer nodes: partitions Login via SSH to nucleus.biohpc.swmed.edu Nucleus005 (Login Node) Submit job via SLURM 64GB* : Nucleus010-025 128GB : Nucleus026-041 256GB : Nucleus042-049 (GPU : Nucleus048-049) 384GB : Nucleus050 super : Nucleus010-050 5

BioHPC (future partition) 64GB* : 8 nodes -> 0 nodes 128GB : 16 nodes -> 32 nodes 256GB : 8 nodes -> 32 nodes 384GB : 1 nodes -> 2 nodes super : 41 nodes -> 66 nodes GPU : 2 nodes -> 8 nodes total : 41 nodes -> 74 nodes 6

Role of SLURM (resource management & job scheduling) Fair-share resource allocations Allocate resources within a cluster - Supports hierarchical communications with configurable format - For each node, socket : cores per socket : threads per core = 2 : 8 : 2, therefore, total available parallel tasks within a single node = 2*8*2 = 32 When there is more work than resources, SLURM manages queues of work - request resource based on your needs - resource limits by queue, user, etc. 7

SLURM commands SLURM commands - Before job submission : sinfo, squeue, sview, smap - Submit a job : sbatch, srun, sallocate, sattach - During job running : squeue, sview, scontrol - After job completed : sacct 8

Commands: General Information Man pages available for all commands --help option prints brief descriptions of all options --usage option prints a list of the options Almost all options have two formats - A single letter option (e.g. -p super ) - A verbose option (e.g. --partition=super ) 9

sinfo DEMO Reports status of nodes of partitions (partition-oriented format by default) >sinfo --Node (report status in node-oriented form) >sinfo p 256GB (report status of nodes in partition 256GB ) 10

Life cycle of a job after submission Submit a Job Pending Configuration (node booting) Resizing Running Suspended Completing * BioHPC Policy : 16 running nodes per user Cancelled (scancel) Completed (zero exit code) Failed (non-zero exit code) Time out (time limit reached) 11

Demo: submit a job and use squeue, and scontrol to check job >squeue >scontrol show job {jobid} 12

Demo: cancel job with scancle, and use sacct to check previous jobs >scancel {jobid} >sacct u {username} >sacct j {jobid} 13

List of SLURM commands sbatch : Submit a script for later execution (batch mode) salloc : Create job allocation and start a shell to use it (interactive mode) srun : Create a job allocation (if needed) and launch a job step (typically an MPI job) sattach : Connect stdin/stdout/stderr for an existing job or job step squeue : Report job and job step status smap : Report system, job or step status with topology, less functionality than sview sview : Report/update system, job, step, partition or reservation status (GTK-based GUI) scontrol : Administrator tool to view/update system, job, step, partition or reservation status sacct : Report accounting information by individual job and job step 14

List of valid job states PENDING (PD) : Job is awaiting resource allocation RUNNING (R) : Job is currently has an allocation SUSPENDED (S) : Job has an allocation, but execution has been suspended COMPLETING (CG) : Job is in the process of completing COMPLETED (CD) : Job has terminated all processes on all nodes COMFIGURING (CF) : Job has been allocated resources, waiting for them to be ready for use CANCELLED (CA) : Job was explicitly cancelled by the user or system administrator FAILED (F) : Job terminated with non-zero code or other failure condition TIMEOUT (TO) : job terminated upon reaching its time limit NODE_FAIL (NF) : Job terminated with non-zero exit code or other failure condition 15

Demo 01 : submit a single job to single node #!/bin/bash #SBATCH --job-name=singlematlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=0-00:00:30 #SBATCH --output=single.%j.out #SBATCH --error=single.%j.time module add matlab/2013a format: D-H:M:S set up SLURM environment load software (export path & library) time(matlab -nodisplay -nodesktop -nosplash -r "forbiohpctestplot(1), exit") alternatively you may use: matlab nodisplay nodesktop nosplash < script.m Command (s) you want to execute Running time ~3.046 s 16

More sbatch options #SBATCH --begin=now+1hour Defer the allocation of the job until the specified time #SBATCH mail-type=all Notify user by email when certain event types occur (BEGIN, END, FAIL, REQUEUE, etc.) #SBATCH mail-user=yi.du@utsouthwestern.edu Use to reveive email notification of state changes as defined by mail-type #SBATCH --mem=65536 Specify the real memory required per node in MegaBytes. Default value is DefMemPerNode (UNLIMITED) #SBATCH --nodelist=nucleus0[10-20] Request a specific list of node names. The order of the node names in the list is not important, the node names will be sorted by SLURM 17

Demo 02 : submit multiple tasks to single node (sequential tasks) #!/bin/bash #SBATCH --job-name=multimatlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=0-00:00:30 #SBATCH --output=multiplejob.%j.out #SBATCH --error=smultiplejob.%j.time script.m module add matlab/2013a time(sh script.m) Running time ~44.904 s 18

Demo 03 : submit multiple tasks to single node (simultaneous tasks) version 1 #!/bin/bash #SBATCH --job-name=multicorematlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=0-00:00:30 #SBATCH --output=multiplecore.%j.out #SBATCH --error=smultiplecore.%j.time script.m module add matlab/2013a time(sh script.m) Running time ~6.156 s 19

Demo 04 : submit multiple jobs to single node (simultaneous tasks) version 2 #!/bin/bash #SBATCH --job-name=multi_matlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=0-00:00:30 #SBATCH --output=multiplecoreadv.%j.out #SBATCH --error=smultiplecoreadv.%j.time module add matlab/2013a time(sh script.m) script.m Running time ~4.904 s 20

Demo 05 : submit multiple jobs to single node with srun #!/bin/bash #SBATCH --job-name=srunsinglenodematlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH ntasks=16 #SBATCH --time=0-00:00:30 #SBATCH --output=srunsinglenode.%j.out #SBATCH --error=srunsinglenode.%j.time srun will create a resource relocation to run the parallel job SLURM_LOCALID : environment variable; Node local task ID for the process within a job. ( zero-based ) module add matlab/2013a time(sh script.m) script.m Running time ~4.596 s 21

Why srun? 4 slurmd 3 srun 1 2 7 slurmctld slurmstepd 6 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Node 0 srun will create a resource relocation to run the parallel job SLURM_LOCALID : environment variable; Node local task ID for the process within a job. ( zero-based ) 22

Demo 06 : submit multiple jobs to multi-node with srun #!/bin/bash #SBATCH --job-name=srun2nodesmatlab #SBATCH --partition=super #SBATCH --nodes=2 #SBATCH ntasks=16 #SBATCH --time=0-00:00:30 #SBATCH --output=srun2nodes.%j.out #SBATCH --error=srun2nodes.%j.time SLURM_NODEID : the relative node ID of the current node (zero-based) SLURM_NNODES : Total number of nodes in the job s resource allocation SLURM_NTASKS : Total number of processes in the current job module add matlab/2013a time(sh script.m) script.m Running time ~4.741 s 23

Demo 06 : submit multiple tasks to multi-node with srun Matrix 1-8 executed on Nucleus015 Matrix 9-16 executed on Nucleus016 Inside each node, job execution is out-of-order 24

Demo 07 : submit job with dependency >sbatch test_srun2nodes.sh >sbatch depend={jobid} gen_gif.sh 25

Parallelize your job At thread level - Shared Memory: multiple processors can operate independently but share the same memory - multi-threading - Fast but lower scalability (can not cross the node) e.g.: POSIX Threads, usually referred to as Pthreads - Implemented with a <pthread.h> header and a thread library - For C programming language - Thread management creating, joining threads etc. 26

Demo 08 : submit multi-core job to single node #!/bin/bash #SBATCH --job-name=phenix #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH ntasks=30 not necessary need to be specified here #SBATCH --time=0-20:00:00 #SBATCH --output=phenix.%j.out #SBATCH --error=phenix.%j.time Think of your job at a scientific aspect. If this scientific problem can be solved in parallel. module add phenix/1.9 phenix.den_refine model.pdb data.mtz nproc=30 Check with the software or read the manual to see if it is able to use multithreading. Try to use proper parameters If you need help, send email to biohpc-help@utsouthwestern.edu 27

Parallelize your job At node level - Distributed Memory - Processors have their own local memory - Memory is scalable with number of processors Q: How does a processor access the memory of another node? e.g. Message Passing Interface - Scales to over 100k processors - Distributed - Library of functions (for C, C++, Fortran), not a new language 28

Demo 09 : submit multi-node job with message passing #!/bin/bash #SBATCH --job-name=mpi_test #SBATCH --partition=super #SBATCH --nodes=2 #SBATCH ntasks=8 #SBATCH --time=0-00:00:30 #SBATCH --output=mpi_test.%j.out module add mvapich2/gcc/1.9 output mpirun./mpi_only total tasks / number of Node <= 32 memory needed for each task * tasks on each node <= 64GB/128GB/256GB/384GB 29

Demo 10 : submit multi-node job (e.g.: MPI-pthread) #!/bin/bash output #SBATCH --job-name=mpi_pthread #SBATCH --partition=super #SBATCH --nodes=2 #SBATCH ntasks=8 #SBATCH --time=0-00:00:30 #SBATCH --output=mpi_pthread.%j.out module add mvapich2/gcc/1.9 let NUM_THREADS=$SLURM_CPU_ON_NODE/($SLURM_NTASKS/$SLURM_NNODES) mpirun./mpi_pthread $NUM_THREADS more total MPI tasks * thread per tasks/ number of Node <= 32 memory needed for each MPI task * MPI tasks on each node <= 64GB/128GB/256GB/384GB similar strategy for MPI-openMP code, let OMP_NUM_THREAS =$NUM_THREADS 30

Demo 11: Submit a GPU job (e.g.: CUDA) #!/bin/bash #SBATCH --job-name=cuda_test #SBATCH --partition=gpu #SBATCH --gres=gpu:1 #SBATCH --time=0-00:10:00 #SBATCH output=cuda.%j.out #SBATCH --error=cuda.%j.err module add cuda65 Jobs will not be allocated any generic resources unless specifically requested at job submit time. Using the gres option supported by sbatch and srun. Format: --gres=gpu:[n], where n is the number of GPUs Use GPU partition A(320, 10240) B(10240, 320)./matrixMul wa=320 ha=10240 wb=10240 hb=320 31

Getting Effective Help Email the ticket system: biohpc-help@utsouthwestern.edu What is the problem? Provide any error message, and diagnostic output you have When did it happen? What time? Cluster or client? What job id? How did you run it? What did you run, what parameters, what do they mean? Any unusual circumstances? Have you compiled your own software? Do you customize startup scripts? Can we look at your scripts and data? Tell us if you are happy for us to access your scripts/data to help troubleshoot. 32