Grid Computing Competence Center Introduction to the SGE/OGS batch-queuing system Riccardo Murri Grid Computing Competence Center, Organisch-Chemisches Institut, University of Zurich Oct. 6, 2011
The basic problem Process a large set of data. Assumptions: 1. Cannot be done on a single computer for space or time constraints. 2. The data can be subdivided into files, each of which can be processed independently. 3. (Processing each file can comprise several steps.) 4. (Accessing the files over a network has acceptable overhead.)
Today s lab session Two approaches: Local execution of programs (e.g., on your laptop) Batched execution of programs (on a cluster) The goal of these initial lab sessions is to show what the difference is, in practice, and what tools are available in each case. These slides are available for download from: http://www.gc3.uzh.ch/teaching/lsci2011/lab02/lab02.pdf
Login to the cluster ocikbpra.uzh.ch Log in to the cluster: ssh username@ocikbpra.uzh.ch You should be greeted by this shell prompt: [username@ocikbpra ]$ Gather the sample application and test files into a directory lab2: mkdir lab2 cp -av murri/lsci/rank-int.i386 lab2/ cp -av murri/lsci/m0,6*.sms lab2/ cd lab2
The cluster ocikbpra.uzh.ch ssh username@ocikbpra.uzh.ch 00 11 00 11 00 11 internet ocikbpra.uzh.ch /home filesystem /share/apps filesystem (exported over the net) local 1Gb/s ethernet network compute 0 0.local compute 0 1.local compute 0 27.local /state/partition1 (local scratch filesystem) filesystem
Recap from Lab Session 1 Process control features offered by the GNU/Linux shell: background processes with the & operator monitor process status with the ps command send signals to running processes with the kill command Lab Session 1 slides are available for download from: http://www.gc3.uzh.ch/teaching/lsci2011/lab01/lab01.pdf
Timing command execution, I The command /usr/bin/time reports about the time spent by the system executing a command. Typical reports include: user time: CPU time spent processing user-level code. system time: CPU time spent processing kernel-level code. real/elapsed time: time from the start to the end of the program (as would have been measured by an external clock). Quiz: can the CPU time be greater than the real/elapsed time?
Timing command execution, II Exercises: 1. Using man time, figure out how to determine the CPU and real time spent running the command rank-int.i386 M0,6-D5.sms. 2. Can time also report on the memory? If yes, how much memory does the above command take?
Timing command execution, III $ /usr/bin/time./rank-int.i386 M0,6-D5.sms./rank-int.i386 file:m0,6-d5.sms rows:3024 cols:49800... 0.10user 0.04system 0:00.18elapsed 80%CPU (0avgtext+0avgdata 0inputs+0outputs (0major+1971minor)pagefaults 0swaps
Timing command execution, III Command-line to run $ /usr/bin/time./rank-int.i386 M0,6-D5.sms./rank-int.i386 file:m0,6-d5.sms rows:3024 cols:49800... 0.10user 0.04system 0:00.18elapsed 80%CPU (0avgtext+0avgdata 0inputs+0outputs (0major+1971minor)pagefaults 0swaps
Timing command execution, III $ /usr/bin/time./rank-int.i386 M0,6-D5.sms Command output./rank-int.i386 file:m0,6-d5.sms rows:3024 cols:49800 nonze 0.10user 0.04system 0:00.18elapsed 80%CPU (0avgtext+0avgdata 0inputs+0outputs (0major+1971minor)pagefaults 0swaps
Timing command execution, III $ /usr/bin/time./rank-int.i386 M0,6-D5.sms./rank-int.i386 file:m0,6-d5.sms rows:3024 cols:49800... Timing information 0.10user 0.04system 0:00.18elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1971minor)pagefaults 0swaps
Timing command execution, III $ /usr/bin/time./rank-int.i386 M0,6-D5.sms./rank-int.i386 file:m0,6-d5.sms rows:3024 cols:49800... Memory information 0.10user 0.04system 0:00.18elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1971minor)pagefaults 0swaps
Timing command execution, III $ /usr/bin/time./rank-int.i386 M0,6-D5.sms./rank-int.i386 file:m0,6-d5.sms rows:3024 cols:49800... 0.10user 0.04system 0:00.18elapsed 80%CPU (0avgtext+0avgdata I/O and paging info 0inputs+0outputs (0major+1971minor)pagefaults 0swaps
Resource limits, I Why impose limits on the utilization of system resources? What system resources would you want to limit in our case?
Resource limits, II The command ulimit allows setting resource usage limits: $ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited [...] file size [...] (blocks, -f) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 [...] stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 32767 virtual memory (kbytes, -v) unlimited [...]
Resource limits, III Warning: The ulimit command is a shell built-in. It takes immediate effect on all the following commands. To restrict the scope to one command only, enclose it and ulimit in parentheses: $ (ulimit -t 15;./rank-int.i386 M0,6-D8.sms) (Parentheses force the enclosed commands to be executed in a sub-shell.)
Resource limits, IV Exercises: 1. What does the following command do? $ (ulimit -t 15;./rank-int.i386 M0,6-D8.sms) What happens if you leave out the ulimit part? 2. What are the options given by ulimit for limiting memory? 3. What should happen if you run the following command? What really happens? $ (ulimit -m 102400;./rank-int.i386 M0,6-D11.sms) 4. What should happen if you run the following command? What really happens? $ (ulimit -v 102400;./rank-int.i386 M0,6-D11.sms)
SGE/OGS Sun Grid Engine (SGE) is a batch-queuing system produced by Sun Microcomputers; made open-source in 2001. After acquisition by Oracle, the product forked: Open Grid Scheduler (OGS), the open-source version Univa Grid Engine is a commercial-only version, developed by the core SGE engineer team from Sun. Used on UZH main HPC cluster Schroedinger.
SGE architecture, I sge qmaster Runs on master node ocikbpra.uzh.ch Accepts client requests (job submission, job/host state inspection) Schedules jobs on compute nodes (formerly separate sge schedd process) Client programs qhost, qsub, qstat Run by user on submit node Clients for sge qmaster Master daemon has a list of authorized submit nodes
SGE architecture, II sge execd Runs on every compute node Accepts job start requests from sge qmaster Monitors node status (load average, free memory, etc.) and reports back to sge qmaster sge shepherd Spawned by sge execd when starting a job Monitors the execution of a single job
Job submission, I The qsub command is used to submit a job to the batch system. The job consists of a shell script and its (optional) arguments. Example: qsub myscript.sh If any arguments are given after the script name, they will be available to the script as $1, $2, etc. # in myscript.sh, $1="hello" and $2="world" qsub myscript.sh hello world
Job submission, II Upon successful submission, qsub prints a job ID to standard output: $ qsub -cwd myscript.sh Your job 76104 ("myscript.sh") has been submitted This job ID must be used with all SGE commands that operate on jobs. As soon as the job starts, two files will be created, containing the script s standard output (.ojobid) and standard error (.ejobid). $ ls -l myscript.sh* -rwxrwxr-x 1 murri murri 30 Oct 6 14:23 myscript.sh -rw-r--r-- 1 murri murri 0 Oct 6 14:24 myscript.sh.e76104 -rw-r--r-- 1 murri murri 14 Oct 6 14:24 myscript.sh.o76104
Commonly used options for qsub -cwd Execute job in current directory; if not given, the job script is run in the home directory. -o Path name of the file where standard output will be stored. -e Path name of the file where standard error will be stored. -j If -j y is given, then merge standard error into standard output (as they were both sent to the screen).
Monitoring jobs The qstat command is used to monitor jobs submitted to the SGE system. Example: $ qstat job-id prior name user state submit/start at queue 73344 0.60500 mod_run danielyli dt 10/06/2011 14:38:45 all.q@compute-0-13.local 76105 0.50500 myscript.s murri r 10/06/2011 14:40:35 all.q@compute-0-20.local The state column is a combination of the following codes: (see man qstat for a complete list) r Job is running qw Job is waiting in the queue qh Job is being held back in queue E An Error has occurred d Job has been deleted by user t Job is being transferred to compute node
Job submission, III Exercises: 1. Write a script rank1.sh to run the command./rank-int.i386 M0,6-D5.sms, then run it. Does this job appear in qstat output? Compare the output with what you would get when running locally: is there any significant change? 2. Write a script rank2.sh to run the command./rank-int.i386 M0,6-D11.sms, then run it. Does this job appear in qstat output? When do the standard output and standard error files appear? What s their initial content? 3. How can you determine the amount of resources (CPU time, wall-clock time, etc.) used by a job?
Job resource utilization, I The qstat -j command reports information on a job, while it is running Example: $ qstat -j 76106 ============================================================== job_number: 76106 exec_file: job_scripts/76106 submission_time: Thu Oct 6 14:51:45 2011 owner: murri [...] cwd: /home/murri/lsci [...] script_file: myscript.sh usage 1: cpu=00:01:30, mem=8.64453 GBs, io=0.02295, vmem=103.637m, maxvme scheduling info: queue instance "all.q@compute-0-3.local" dropped because it is t [...] The usage line contains current resource utilization.
Job resource utilization, II The qacct command reports all information on a job, but only after it has completed. Example: $ qacct -j 76106 ============================================================== qname all.q hostname compute-0-27.local group murri [...] jobname myscript.sh jobnumber 76106 taskid undefined [...] qsub_time Thu Oct 6 14:51:45 2011 start_time Thu Oct 6 14:51:50 2011 end_time Thu Oct 6 14:54:29 2011 [...] exit_status 0 ru_wallclock 159 ru_utime 158.421 ru_stime 0.456 [...] cpu 158.877 mem 15.183 [...] maxvmem 103.637M [...]
Resource utilization, I The -l option to qsub allows specifying what resources will be needed by a job. The most common resource requirements are: s rt Total job runtime (wall-clock time), in seconds s cpu Total job CPU time, in seconds mem free Request at least this much free RAM; use m or g suffix for MB or GB s mem Upper limit to RAM usage; use m or g suffix s vmem Upper limit to virtual memory usage; use m or g suffix Example: # run job with a time limit of 20 seconds $ qsub -l s_rt=20 myscript.sh
Resource utilization, II Exercises: 1. Is the following job limited to 20 seconds runtime? $ qsub -l s_rt=20 rank2.sh What do you find the in the job s stdout and stderr file? Compare with what happens in the ulimit case. What happens if you replace s rt by s cpu? 2. Run the same job, putting a 10MB limit on mem free, then s rss, s mem, and finally s vmem. Compare the actual resource utilization (via qacct) with the requirement. In what cases does the job terminate correctly? What s the resource utilization in this cases? 3. Compile a table with runtime, CPU time, and memory utilization for each of the matrices M*.sms. Is there a correlation with the matrix file size?
References [1] setrlimit(2) manual page, http://manpages.ubuntu.com/manpages/oneiric/ en/man2/getrlimit.2.html