Juropa. Batch Usage Introduction. May 2014 Chrysovalantis Paschoulas

Size: px

Start display at page:

Download "Juropa. Batch Usage Introduction. May 2014 Chrysovalantis Paschoulas c.paschoulas@fz-juelich.de"

Collin Moore
8 years ago
Views:

1 Juropa Batch Usage Introduction May 2014 Chrysovalantis Paschoulas

2 Batch System Usage Model A Batch System: monitors and controls the resources on the system manages and schedules the jobs enforces limits and policies according to the Batch Model allocates the resources, sets up the environment and then runs the jobs Juropa Cluster (3289 compute nodes) Juropa JSC Partition 3033 compute nodes 64 nodes are dedicated to interactive jobs A few nodes are dedicated to software or system tests Rest of the compute nodes are used for normal batch jobs 256 compute nodes for a special partition

Cluster (3289 compute nodes) Juropa JSC Partition 3033 compute nodes 64 nodes are dedicated to interactive jobs A few nodes are

3 Batch System On Juropa for Batch System we use the combination of Moab and Torque Moab is the Workload Manager - Batch Scheduler Torque is the Resource Manager Moab/Torque Batch System: Manages policies, priorities, limits Starts jobs, manages output Provides features for advanced reservations, backfilling etc. Job statistics and accounting User commands for job submission, job query, job canceling etc.

policies, priorities, limits Starts jobs, manages output Provides features for advanced reservations,

4 Batch System Configuration & Limits In order to implement and manage the scheduling, the Batch System uses the abstraction of queues (classes), like jsc, inter_jsc etc. Class configuration: allowed users, list of nodes, max. limits, default values etc. Interactive jobs 64 compute nodes available No node Sharing Max. number of nodes: 16 (default: 1) Max. wall-clock time: 10h (default: 30 min) Max. running jobs: 20 per user (including batch jobs) Accounting: (number of nodes) x (connect-time)

limits, default values etc. Interactive jobs 64 compute nodes available No node Sharing Max.

5 Batch System Configuration & Limits Batch jobs 3033 compute nodes available No node sharing Max. number of nodes per job: 1024 Max. wall-clock time: 24h Max. number of running jobs: 20 per user Max. number of eligible jobs is limited to 20 per user Default values: Number of nodes: 1 Walltime limit: 30min Tasks per node: 8 Accounting: (number of nodes) x (wall-clock time) Jobs requesting more nodes than the limits can be run on special request Not included in normal scheduling Will be run e.g. once per week if needed during nonprime time Please contact sc@fz-juelich.de

number of eligible jobs is limited to 20 per user Default values: Number of nodes: 1 Walltime limit: 30min Tasks per node: 8 Accounting:

6 Compiling Programs In order to compile and execute parallel programs on Juropa, there are available MPI wrappers for the Intel compilers. These wrappers build up the MPI environment for the compilation task. The current wrappers are: mpicc, mpicxx, mpif77, mpif90 Some useful compiler options: -openmp: enables OpenMP -g: creates debugging information -L: path to libraries for the linker -O[0-3]: optimization levels The wrappers (as many tools) are provided as modules and it is very easy for the users to choose the version of the compiler they want. module load parastation/mpi2 intel Compile example: mpicxx O2 program.cpp o mpi_program Execute MPI program: mpiexec <options> mpi_program

The current wrappers are: mpicc, mpicxx, mpif77, mpif90 Some useful compiler options: -openmp: enables OpenMP -g: creates debugging information -L: path to libraries for

7 Job Submission After the compilation, the users can submit a job with the command: msub <options> <executable> or <jobscript> For example: msub l nodes=16:ppn=8 l walltime=04:00:00 e /lustre/jhome/zam/err o /lustre/jhome5/zam/out myscript Useful options of msub: l nodes=<num> # number of nodes l ppn=<num> # procs per node l walltime=<hh:mm:ss> # requested wall clock time j oe # combine stderr and stdout M < address> # send to this address m eab # send on end, abort or begin N <name> # name of job I # run an interactive job v tpt=<num threads> # number of OpenMP threads q <queue name> # destination queue (class) Instead of using msub with all this options in a single command line, it is possible for the users include all the job submission options in the batch script.

combine stderr and stdout M <email address> # send email to this address m eab # send email on end, abort or begin N <name> # name of job I # run an interactive job v tpt=<num threads> # number of

8 Batch Script The users in order to define the msub options in their batch scripts, they have to use #MSUB directives. For parallel jobs the users have to include the MPI execution command mpiexec. Example 1 In this example we will have an MPI application that will start 64 MPI tasks on 8 nodes using 8 cores per node. Each MPI task runs on a single core. #!/bin/bash x #MSUB l nodes=8:ppn=8 #MSUB l walltime=04:00:00 #MSUB e /home/jhome3/test_user/my error.txt #MSUB o /home/jhome3/test_user/my out.txt ### start of jobscript cd $PBS_O_WORKDIR echo "workdir: $PBS_O_WORKDIR" # NSLOTS = nodes * ppn = 8 * 8 = 64 NSLOTS=64 echo "running on $NSLOTS cpus " mpiexec np $NSLOTS./mpi_program

Example 1 In this example we will have an MPI application that will start 64 MPI tasks on 8 nodes using 8 cores per node. Each MPI task runs on a single core. #!

9 Batch Script Example 2 In the following example, the application will be started on 32 nodes using 1 MPI tasks per node, 32 tasks in total, and only one core will be used per node. #!/bin/bash x #MSUB N MPI_32x1_job #MSUB l nodes=32:ppn=1 ### start of jobscript ### mpiexec np 32 mpi_program >> $PBS_O_WORKDIR/out.$PBSJOBID Example 3 In this example we have a Hybrid job that uses MPI and OpenMP. The application will be started on 4 nodes, on each node 1 MPI tasks will be created, 4 tasks in total, and each task will have 8 OpenMP threads. #!/bin/bash x #MSUB N hybrid_8x8_job #MSUB l nodes=4:ppn=8 #MSUB v tpt=8 ### start of jobscript ### export OMP_NUM_THREADS=8 mpiexec np 4 exports=omp_num_threads mpi_omp_program

$PBSJOBID Example 3 In this example we have a Hybrid job that uses MPI and OpenMP.

10 Task Allocation & SMT Task Allocation SMT To calculate the total number of MPI tasks for a job: total number of MPI tasks = (number of nodes)x(procs per node) When we have hybrid MPI/OpenMp jobs (option -tpt is used): total tasks = (number of nodes)x((procs per node)/(threads per task)) To calculate the number of MPI tasks per node for hybrid jobs: MPI tasks per node = (procs per node)/(threads per task) The compute nodes on Juropa support the SMT technology (Intel Xeon CPUs Nehalem Arch.). In order to use SMT for the jobs, the users have to set the msub option ppn=16. For example: #!/bin/bash x #MSUB N SMT_hybrid_8x8_job #MSUB l nodes=4:ppn=16 #MSUB v tpt=8 ### start of jobscript ### export OMP_NUM_THREADS=8 mpiexec np 8 exports=omp_num_threads application.exe

node)/(threads per task) The compute nodes on Juropa support the SMT technology (Intel Xeon CPUs Nehalem Arch.). In order to use SMT for the jobs, the users have to set the msub option ppn=16.

11 More Options Interactive Jobs In order to start an interactive job on Juropa, the users have to submit a job with the option I I of msub. If the resources are free, the users will automatically have access to the nodes and can start the application. msub I l nodes=2:ppn=8,walltime=00:15:00 Job dependencies & Job chains Users can submit a job defining dependencies msub W depend=<jobid> <jobscript> or even submit job chains #!/bin/bash NO_OF_JOBS=<no of jobs > JOB_SCRIPT=<jobscript> i=0 JOBID=$(msub $JOB_SCRIPT 2>&1 grep v e '^$' sed e 's/\s*//') while [ $i le $NO_OF_JOBS ]; do JOBID=$(msub W depend=afterok:$jobid $JOB_SCRIPT 2>&1 grep v e '^$' sed e 's/\s*//') let i=$i+1 done

msub I l nodes=2:ppn=8,walltime=00:15:00 Job dependencies & Job chains Users can submit a job defining dependencies msub W depend=<jobid> <jobscript> or even submit job

12 Batch System - Commands msub Submit a job Returns job ID on success Note: during times of heavy load Moab might run into a timeout showq [-r -i -b] [-u <userid>] Shows all, running, idle or blocked jobs of all or specified users mjobctl -c <jobid> Cancel queued or running job mjobctl -c -w USER=<userID> Cancel all jobs of specified user checkjob [-v] <jobid> Display detailed information on a specified job

of all or specified users mjobctl -c <jobid> Cancel queued or running job mjobctl -c -w USER=<userID>

13 Batch System -Commands showstart <jobid> Shows estimated start-time of specified job. Estimated starttime can change while jobs with higher priority get scheduled mjobctl --help Shows all options, e.g. how to hold or resume holds on jobs showbf -c jsc Shows available resources for immediate use For more detailed information on Moab commands please see: Graphical view of usage, jobs, distribution of jobs, etc. llview Was developed by W.Frings, member of JSC

14 Batch System Job Scheduling Some comments on job scheduling Jobs are scheduled by priority Priority increases with number of nodes Priority increases during waiting (aging) No node sharing Backfilling mode Please specify the requested wall-time as exact as possible Jobs of users (groups) who ran out of cpu quota will get very low priority Please be aware of the fact that estimated start-time change very often

mode Please specify the requested wall-time as exact as possible Jobs of users (groups) who ran out of

15 Job life-cycle 6. When there are enough free resources, Moab converts the job script into PBS syntax and tells the RM to start the job on the set of the associated nodes(calls qsub). Juropa Master Node MOAB Server 5. msub uses the submission filter to put the jobs into the proper queue and the job goes into the Moab queue. TORQUE Server 7. Torque starts the prologue on the compute nodes associated to the job. When all the required resources are available AND the nodes are in healthy condition then it starts the jobscript on the Mother Superior. Compute Nodes Login Nodes Compute node Users develop their Software. Mother Superior Compute node 02.. Compute node N 8. Mother Superior runs the jobscript and execute the mpiexec command. This communicates via psid daemons to start the MPI tasks on all compute nodes. And when is completed, the RM daemons run the epilogue on all nodes to clean up resources. 2. Compile with MPI Wrappers. 3. Create the batch script. 4. Submit their jobs with msub.

Torque starts the prologue on the compute nodes associated to the job.

16 Batch System CPU quota & Accounting Query current status of cpu quota q_cpuquota <options> q_cpuquota -? # shows all available options Types of CPU quota Fixed: : a fixed amount of cpu quota can be used during the allocation period. (refers to small quota amounts) Monthly: : jobs will be scheduled with normal priority until current, previous and next monthly quota is exhausted. CPU quota not used in this time frame is lost. Charging mode Users will be charged for wall-clock time of their jobs on the set of nodes 3 states of contingent: normal, low-cont, no-cont (monthly quotas) CPU Quotas are defined per group All members of a group will be informed by mail if the group runs out of cpu quota or if new quota is assigned Jobs will get low priority (<0) and reduced wall-time limit (6 hours) when CPU quota is used up. They will only run if no other waiting jobs fit into free resources

(refers to small quota amounts) Monthly: : jobs will be scheduled with normal priority until current, previous and next monthly quota is exhausted. CPU quota not used in this time frame is lost.

17 Further Information Regular preventive maintenance every second Thursdays See Message of today at login Get recent status updates by subscribing to the system highmessages as described at the bottom of this page: Juropa on-line documentation User support at FZJ Phone:

bottom of this page: http://juelich.de/jsc/compserv/services/high_msg.

Batch Scripts for RA & Mio

Batch Scripts for RA & Mio Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Jobs are Run via a Batch System Ra and Mio are shared resources Purpose: Give fair access to all users Have control over where jobs