NYUAD HPC Center Running Jobs

Transcription

1 NYUAD HPC Center Running Jobs 1 Overview... Error! Bookmark not defined. 1.1 General List... Error! Bookmark not defined. 1.2 Compilers... Error! Bookmark not defined. 2 Loading Software... Error! Bookmark not defined. 3 Version Changes... Error! Bookmark not defined. 4 Requesting Additional Software... Error! Bookmark not defined. 5 Software Available at NYU... Error! Bookmark not defined. 1 Running your job Submitting and running jobs on any cluster involves these 3 major steps: 1. Setup and launch. 2. Monitor progress. 3. Retrieve and analyze the results. Before you start, please read the document for information and policies on your data allocation. Of all the space, only /scratch should be used for computational purposes. NYUAD HPC Services strongly recommends that you use restart and check pointing techniques to safeguard your computational results in the event of a crash or outage. It is the responsibility of the researcher to develop these techniques. The larger your jobs are and the longer they are scheduled to run, the more important this is. At a minimum, be sure to develop a restart program file within your source code. A restart file enables you to restart the job at certain intervals. The main purpose is to divide a large job into sections, so that it will run within the scheduled time and, if there are any unplanned outages, the entire job will not be lost. 27-Nov-12 Page 1 of 10

2 The following sections describe how to submit and run your jobs. 2 PBS Scripts NYUAD HPC Center Running Jobs You'll need to use PBS (Portable Batch System) scripts to set up and launch jobs on any cluster. While it is possible to submit batch requests using an elaborate command line invocation, it is much easier to use PBS scripts, which are more transparent and can be reapplied for sets of slightly different jobs. A PBS script performs these two key jobs: 1. It tells the scheduler about your job, such as: The name of the program executable How many CPUs you need and length of time to run the job What to do if something goes wrong 2. The scheduler will 'run' your script when it comes time to launch your job. A typical PBS script looks like this: #!/bin/bash #PBS -l nodes=1:ppn=1,walltime=5:00:00 #PBS -N jobname #PBS -e localhost:/scratch/netid/${pbs_jobname}.e${pbs_jobid} #PBS -o localhost:/scratch/netid/${pbs_jobname}.o${pbs_jobid} cd /scratch/netid/jobdirectory/./command &> output exit 0; The first "#PBS -l" line tells the scheduler to use one node with one processor per node (1 CPU in total), and this job will abort if not completed in 12 hours. You should put your job's name after "#PBS -N". If you would like to receive s regarding to this job, you may leave your address after "#PBS -M". The "" asks the system to you when the job Aborts, Begins, and Ends. Two kinds of files, error files and output files, are usually generated when the job gets executed. The path to store these files are controlled by the "#PBS -e" and "#PBS -o". You might want to change the path correspondingly. Page 2 of Nov-12

3 All lines that start with #PBS pass a PBS command, while adding a white space does not. For example, in the lines below the first line will implement a PBS walltime, whereas the second line will not. #PBS -l walltime=4:0:0 # PBS -l walltime=4:0:0 After setting up all the parameters as above, you may tell the scheduler how to execute your job by listing all the commands. You may also set up environmental variables right before these commands. Estimating Resources Requested Estimating walltime as accurately as possible will help MOAB/Torque to schedule your job more efficiently. If your job requires hours to finish do not ask for a much longer walltime. Please review available queues ( and queue parameters offered by NYUAD HPC. Estimating the number of nodes and the number of CPU cores is equally important. Requesting more nodes or more CPU cores than the job needs will remove these resources from the available pool. Serial jobs should use one CPU core (ppn=1) unless there are higher than usual memory requirements. If a higher memory requirement is essential, send to [email protected] to arrive at the best coreto-memory distribution for a given job. There are occasions when using an entire compute node for a single process is required and all 12 CPU cores should be requested for a serial job (ppn=12). Invoking Interactive Sessions Use the following PBS command to initiate an interactive session to a compute node on the cluster: $ qsub -I -q interactive -l nodes=1:ppn=12,walltime=04:00:00 Use the following PBS command to initiate an interactive session. It uses an X session to a compute node on the cluster: $ qsub -I -X -q interactive -l nodes=1:ppn=12,walltime=04:00:00 Submitting Serial Jobs A serial job on NYUAD HPC is defined as a job that does not require more than one node, which do not involve any inter-compute node data communications either. Submitting Single-Core Jobs A serial job usually takes 1 CPU core in a node. We specify this in the "#PBS -l" line. The PBS script should be like this, #!/bin/bash 27-Nov-12 Page 3 of 10

4 #PBS -l nodes=1:ppn=1,walltime=5:00:00 #PBS -N jobname #PBS -e localhost:$pbs_o_workdir/${pbs_jobname}.e${pbs_jobid} #PBS -o localhost:$pbs_o_workdir/${pbs_jobname}.o${pbs_jobid} cd /scratch/netid/jobdirectory/ NYUAD HPC Center Running Jobs./serialtest &> output exit 0; We then save this script in a text file, say job.pbs. Then we submit the job by running, $ qsub job.pbs Submitting OpenMP Serial Jobs Although OpenMP (NOT OpenMPI) jobs can use more than one CPU cores, all such cores are within a node. The OpenMP jobs, as a result, are serial jobs and cannot be submitted to Bowery. To submit an OpenMP job to 1 node and 8 CPU cores: #!/bin/bash #PBS -l nodes=1:ppn=12,walltime=5:00:00 #PBS -N jobname #PBS -e localhost:$pbs_o_workdir/${pbs_jobname}.e${pbs_jobid} #PBS -o localhost:$pbs_o_workdir/${pbs_jobname}.o${pbs_jobid} cd /scratch/netid/jobdirectory/ export OMP_NUM_THREADS=12./omptest &> output exit 0; Page 4 of Nov-12

5 Submitting Parallel Jobs Parallel jobs use more than one node and usually contain cross-node message/data communications. MPI is widely used for parallel jobs. MPI wrappers are available on all the NYU HPC clusters. However, it is highly recommended to launch parallel jobs on the Bowery cluster. You are also encouraged to submit p12 jobs (jobs with walltime equal or less than 12 hours) as there are many more p12 nodes available. Additionally, 96 nodes from chassis 4 to 9 on Bowery are all 12-hour nodes and have 12 CPU's per node. By declaring ppn=12, you can make sure your jobs go to these often less busier compute nodes. This will also avoid wasting resources by utilizing all the 12 CPU cores on each node. Submitting MPI Parallel Jobs To submit an MPI job to 2 nodes and 24 CPU cores: #!/bin/bash #PBS -l nodes=2:ppn=12,walltime=5:00:00 #PBS -N jobname #PBS -e localhost:$pbs_o_workdir/${pbs_jobname}.e${pbs_jobid} #PBS -o localhost:$pbs_o_workdir/${pbs_jobname}.o${pbs_jobid} cd /scratch/netid/jobdirectory/ /share/apps/mpiexec/0.84/gnu/bin/mpiexec -comm ib -np 24./mvatest &> output exit 0; Submitting MPI Jobs with Fewer CPU Cores It is also possible to claim fewer CPU cores than what a node actually has with an MPI job. To submit a serial MPI job to 1 node and 4 CPU cores: #!/bin/bash #PBS -l nodes=1:ppn=4,walltime=5:00:00 #PBS -N jobname 27-Nov-12 Page 5 of 10

6 #PBS -e localhost:$pbs_o_workdir/${pbs_jobname}.e${pbs_jobid} #PBS -o localhost:$pbs_o_workdir/${pbs_jobname}.o${pbs_jobid} cd /scratch/netid/jobdirectory/ NYUAD HPC Center Running Jobs /share/apps/mpiexec/0.84/gnu/bin/mpiexec -np 4 env VIADEV_ENABLE_AFFINITY=0./mpitest &> output exit 0; You must specify the "env VIADEV_ENABLE_AFFINITY=0" in your script with MPI serial jobs because otherwise, MPI tends to bind your jobs with a certain group of CPUs (the first four CPUs in a node, for example). If you or someone else submit another serial MPI job to the same node, the job may also be bound to the same CPUs and thus both calculations will be impeded. Submitting bigmem jobs on BuTinah The bigmem queue on BuTinah has been created for jobs with memory requirements of more than 48 GB memory. If the memory usage is less than 48 GB, please use other compute nodes. #!/bin/sh #PBS -V #PBS -N PBS_JOB_NAME #PBS -l nodes=1:ppn=12,walltime=12:00:00 #PBS -e localhost:$pbs_o_workdir/${pbs_jobname}.e${pbs_jobid} #PBS -o localhost:$pbs_o_workdir/${pbs_jobname}.o${pbs_jobid} #PBS -q bigmem #PBS -l mem=64gb Submit many similar runs (serial & parallel) with mpiexec Because of the overhead incurred by the scheduler processing each job submitted, which is particularly serious when the jobs are small and/or short in duration, it is generally not a good idea to simply run a loop over qsub and inject a large set of small jobs into the queue. It is often far more efficient from this point of view to package such little jobs up into larger 'super-jobs', provided that each small job in the larger job is expected to finish at about the same time (so that cores aren't left allocated but idle). Page 6 of Nov-12

7 Assuming this condition is met, what follows is a recommended method of aggregating a number of smaller jobs into a single Torque/PBS job which can be scheduled as a unit. that there exists in principle another approach to this problem using a feature of Torque/PBS called job arrays, but this feature still incurs significant scheduling overhead (because an array of N jobs is still handled internally as N ordinary jobs), in contrast to the method described here. For simplicity this example assumes the small jobs are serial (one-core) jobs. Firstly, group the small jobs into sets of similar runtime (choose the largest multiple of 12 for queues p12 & p48 which will end close together), and package each set of N separate similar runtime jobs as a single N-core job as follows. The PBS directives at the top of the submission script should specify: For queue p48, p12 #PBS -l nodes=<n/12>:ppn=12 where the <>s above should be replaced by the results of the trivial calculations enclosed. Then instead of the usual Executable Line $./$executable $arguments at the end of the submission script to launch one serial job, launch N via something like the following. that this makes use of the Ohio SC version of mpiexec. This won't work with OpenMPI. Launch Commands source /etc/profile.d/env-modules.sh module load mpiexec/gnu/0.84 cd directory_for_job1 mpiexec -comm none -n 1 $executable arguments_for_job1 > output 2> error & cd directory_for_job2 mpiexec -comm none -n 1 $executable arguments_for_job2 > output 2> error &... cd directory_for_jobn mpiexec -comm none -n 1 $executable arguments_for_jobn > output 2> error & wait 27-Nov-12 Page 7 of 10

8 In the above: 1. the use of mpiexec (not mpirun) with the options -comm none -n 1, which mean that in this case, the application isn't using MPI and just needs 1 core (being serial). We are simply using the job launch functionality of mpiexec in this example, but we could alter the arguments to launch parallel MPI 'small' jobs instead of serial ones (in the case of MVAPICH2, add the -comm pmi option to select the correct parallel launch protocol); 2. the > output 2> error which direct the stdout and stderr to files called output and error respectively in each directory for the corresponding job (obviously you can change the names of these, and even have the jobs running in the same directory if they are really independent; check example below); 3. the & at the end of each mpiexec line which allows them to run simultaneously (the mpiexecs will cooperate and take different cores out of the set allocated by Torque/PBS); 4. the wait command at the end, which prevents the job script from finishing before the mpiexecs. 3 Many Serial Tasks in One Batch Job with PBSDSH Often it is necessary to have tasks run in parallel without using MPI. Torque provides a tool called pbsdsh to facilitate this. It makes best use of available resources for embarrassingly parallel tasks that don't require MPI. This application is run from your submission script and spawns a script or application across all the CPUs and nodes allocated by the batch server. This means that the script must be smart enough to figure out its role within the group. Here is an example of a job on 24 cores. PBS Script #!/bin/sh #PBS -l nodes=2:ppn=12 pbsdsh $PBS_O_WORKDIR/myscript.sh Since the same shell script myscript.sh is executed on each core, that script needs to be clever enough to decide what its role is. Unless all processes shall do the same, we have to distinguish cores or processes. The environment variable PBS_VNODENUM helps. In case of n requested cores it takes a value from 0 to n-1 and numbers the requested cores. You can use PBS_VNODENUM to submit it to the same program as an argument, to start a different program in each process or to read different input files. The following three examples show the shell script myscript.sh belonging to these three cases. Example: Submit PBS_VNODENUM as Argument PBS_VNODENUM as Argument Page 8 of Nov-12

9 $ cat myscript.sh #!/bin/sh cd $PBS_O_WORKDIR PATH=$PBS_O_PATH./myprogram $PBS_VNODENUM Setting the current directory and the environment variable PATH is necessary since only a very basic environment is defined by default. Example: Start Different Programs Different Programs $ cat myscript.sh #!/bin/sh cd $PBS_O_WORKDIR PATH=$PBS_O_PATH./myprogram.$PBS_VNODENUM Example: Read Different Input Files Different Input Files $ cat myscript.sh #!/bin/sh cd $PBS_O_WORKDIR PATH=$PBS_O_PATH./myprogram < mydata.$pbs_vnodenum Link to PBSDSH examples Monitoring Your Job You can monitor your job's progress while it is running. There are various ways to do that. One way is by using this PBS command: $ showq You'll see all current jobs on the cluster. To see the lines relevant to your job, you can use this command: $ showq grep NetID You should see a line like this: $ showq grep NetID NetID Running 16 23:26:49 Wed Feb 13 14:38:33 27-Nov-12 Page 9 of 10

10 The above result indicates the job number; owner name; job status; number of CPUs; time remaining to run; and the date and time the job was submitted. You can also see this same information by using the following command: $ showq -u NetID Or simply type "myq": $ myq If the cluster is busy, your job may have to wait in the queue, in which case the status of the job would be Idle. If you are interested in the current cluster usage, you may input: $ pbstop Each column represents a node. The node is busy if the column is filled with letter blocks. You typically need to wait longer before your jobs get executed if fewer nodes are available. To see where you jobs are running locally, type: $ pbstop -u NetID Be sure to substitute your own "NetID" for NetID Deleting a Job If you want to stop your job before it has finished running, you can do so using the qdel command: $ qdel jobid To stop/delete all the jobs $ qdel all "qdel all" deletes all the jobs owned by an user irrespective of state of the job. 4 Questions? Please read our FAQs on our website page first. If you have more questions on running jobs, please send an to [email protected]. Page 10 of Nov-12