Beowulf Training Using Parallel Computing to Run Multiple Jobs Jeff Linderoth August 5, 2003 August 5, 2003 Beowulf Training Running Multiple Jobs Slide 1
Outline Introduction to Scheduling Software The Wonderful World of PBS The Equally Wonderful World of Condor Lab Time. Why do we need scheduling software? August 5, 2003 Beowulf Training Running Multiple Jobs Slide 2
Resource Scheduling So people don't ght over the resources! Schedulers... Locate appropriate resources, Manage resources, so multiple processes don't conict over the same processor Ensure a fairness policy, Are integrated with accounting software. The schedulers on the Beowulf cluster are PBS and Condor. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 3
Mmmmmmmmmmmmmm. Pie Our rst computational task will be to estimate π by numerical integration. Everyone knows... 1 0 1 1 + x 2 dx = arctan(x) 1 x=0 = arctan(1) = π 4. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 4
The Rectangle Rule 4 3.5 4/(1+x*x) 3 2.5 2 1.5 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 August 5, 2003 Beowulf Training Running Multiple Jobs Slide 5
A Program to Estimate π I've written a π-calculator for you. cd mkdir compute-pi cd compute-pi cp /tmp/training/session2/pi1.c. gcc pi1.c -lm -o pi1./pi1 1000 This is not a parallel program. Just a simple (one process) program. Nevertheless, we must submit it through a scheduling system to run it on the Beowulf cluster. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 6
Running with PBS A simple four step process... Create a PBS submission script Submit the script to the PBS system using the command qsub PBS runs the script on the rst available resources PBS collects output for user's inspection August 5, 2003 Beowulf Training Running Multiple Jobs Slide 7
The PBS Submission Script Overview (1) You make a request for resources, (2) PBS will allocate a node pool to fulll your request. (3) Now you have to tell the node pool what to do! Both steps (1) and (3) are accomplished through the PBS submission script The script contains PBS request statements Shell commands that will run your job on the allocated resources. The shell commands are executed on the rst node in your allocated nodes August 5, 2003 Beowulf Training Running Multiple Jobs Slide 8
Our First PBS Submission Script #PBS -q small #PBS -l nodes=1:public #PBS -l cput=00:05:00 #PBS -V echo "The PBS job ID is: ${PBS_JOBID}" echo "The PBS Node File is" cat $PBS_NODEFILE $HOME/compute-pi/pi1 100 August 5, 2003 Beowulf Training Running Multiple Jobs Slide 9
Format of the PBS Submission Script Lines that begin with #PBS are PBS directives Everything else is a shell command Shell commands are just things that you would type at the regular login-prompt. But you can also do fancy looping and conditions. http://www.gnu.org/manual/bash/html chapter/bashref toc.html After the PBS commands, you put any commands you would like. Usually the command to run your program is usually a good one to include. :-) Again, this is executed on the rst node. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 10
Breaking It Down. PBS Directives -q Species the queue in which to place the job. We have two queues, small and large small Max CPU time 20 minutes/process. large Lower priority than jobs in small queue -l Denes the resources that are required by the job and establishes a limit to the amount of resource that can be consumed. -V Declares that all environment variables in the qsub command's environment are to be exported to the batch job. If you would like the PBS job to inherit the same environment as the one you are currently running in (same PATH variable, etc), you should include this directive. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 11
The -l Story For resources, you will typically only need to declare the number of nodes, which class of nodes you request #PBS -l nodes=4:public the maximum cpu time #PBS -l cput=00:15:00 For the truly brave and curious the command is man pbs resources August 5, 2003 Beowulf Training Running Multiple Jobs Slide 12
PBS The Big Three qsub Submit a PBS job qstat Check the status of a PBS job qdel Delete a PBS job man <command> will give you lots more information August 5, 2003 Beowulf Training Running Multiple Jobs Slide 13
Let's do it! [jtl3@fire1 compute-pi-1]$ qsub run.pbs 5972.fire1 [jtl3@fire1 compute-pi-1]$ qstat -a fire1: Req d Req d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 5972.fire1 jtl3 small run.pbs 27018 1 -- -- 00:20 E -- Note that the job ID is printed for you when you submit the job qstat -a : Shows the status of all jobs August 5, 2003 Beowulf Training Running Multiple Jobs Slide 14
Looking at the Output By default standard output goes to <scriptname>.o<job number> By default standard error goes to <scriptname>.e<job number> [jtl3@fire1 compute-pi-1]$ cat run.pbs.o5972 The PBS job ID is: 5972.fire1 The PBS Node File is fire34 pi is about 3.1614997369512658487167300 Error is 1.9907083361472733e-02 Note how the PBS environment variables are interpreted in the script. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 15
Other Cool PBS Stuff You May Want To Do #PBS -N <Name> : Name your job #PBS -o <File.out> : Redirect standard output to File.out #PBS -e <File.err> : Redirect standard error to File.err #PBS -m -M : Mail options Job dependencies For a list of all PBS command le options... man qsub Any PBS Questions? August 5, 2003 Beowulf Training Running Multiple Jobs Slide 16
Condor For purposes of this discussion, think of Condor as a different scheduler. Condor is a bit more fancy. Used often for nondedicated resources. (Will run only when no one else would use the machine). Checkpointing/Migration Remote I/O Likely, the accounting charge will be less for jobs submit to the Condor scheduler. http://www.cs.wisc.edu/condor http://www.lehigh.edu/~inlts/comp/linux/condor/ August 5, 2003 Beowulf Training Running Multiple Jobs Slide 17
Checkpointing/Migration Professor s Machine Professor Arrives } 5 min 5am 8am Grad Student s Machine Checkpoint Server Grad Student Arrives Grad Student Leaves 8:10am } 12pm 5 min August 5, 2003 Beowulf Training Running Multiple Jobs Slide 18
Condor Universes Condor jobs are submit to a specic Condor Universe Standard Has cool features like checkpointing and migration of jobs Requires special linking of your program Vanilla No cool condor features (regular) MPI/PVM Not mentioned here today, but they exist. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 19
Compiling for Condor Standard Universe Put the command condor compile in front of your normal link line. [jtl3@fire1 condor]$ condor compile gcc pi1.c -o pi1-standard -lm Vanilla Universe Do nothing Now Condor submission is like PBS submission Different command (job description) le Different submission/montoring commands August 5, 2003 Beowulf Training Running Multiple Jobs Slide 20
A Sample Condor Submission File universe = standard executable = pi1-standard arguments = 1000000000 output = pi1.out error = pi1.err notification = Complete notify_user = jtl3@lehigh.edu getenv = True rank = kflops queue man condor submit August 5, 2003 Beowulf Training Running Multiple Jobs Slide 21
The Big Four condor submit <job.condor> Submit a job to the Condor scheduler condor q Check the status of the queue of Condor jobs condor status Check the status of the condor pool condor rm <jobid> Delete a Condor job August 5, 2003 Beowulf Training Running Multiple Jobs Slide 22
Let's Do It! [jtl3@fire1 condor]$ condor_submit run.condor Submitting job(s). 1 job(s) submitted to cluster 16. [jtl3@fire1 condor]$ condor_q -- Submitter: fire1.cluster : <192.168.0.1:32777> : fire1.cluster ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 16.0 jtl3 8/4 11:22 0+00:00:16 R 0 3.4 pi1-standard 1000000000 [jtl3@fire1 condor]$ cat pi1.out pi is about 3.1415926555921398488635532 Error is 2.0023467328655897e-09 I could do condor rm 16.0 Any Condor questions? August 5, 2003 Beowulf Training Running Multiple Jobs Slide 23
Quit Wasting My Time! OK, Linderoth, I thought today was supposed to be about parallel computing! That will be the focus of the next section(s) For now, let's do some simple parallel computing. Suppose I'd like to run the same executable pi1, but with many different input les or parameters. Use the multiple processors to get your work done faster August 5, 2003 Beowulf Training Running Multiple Jobs Slide 24
Running Many Jobs We need a way to easily submit many different jobs We will use the shell's scripting capabilities PBS Use a template command le and the sed utility Condor Use the -a ag to condor submit August 5, 2003 Beowulf Training Running Multiple Jobs Slide 25
PBS Run Multiple Jobs. Step #1 Create a template submission le. #!/bin/bash #PBS -q small #PBS -l nodes=1:public #PBS -l walltime=00:05:00 #PBS -V echo "The PBS job ID is: ${PBS_JOBID}" echo "The PBS Node File is" cat $PBS_NODEFILE /home/jtl3/class/pbs/pi1 XXX N XXX August 5, 2003 Beowulf Training Running Multiple Jobs Slide 26
PBS Run Multiple Jobs. Step #2 Create a shell script to do the multiple submission #!/bin/bash for n in 100 1000 10000 100000 1000000 do sed s/xxx_n_xxx/$n/g run.pbs.template > run.pbs.tmp qsub run.pbs.tmp rm run.pbs.tmp done The sed commands replaces all occurances of the pattern XXX N XXX with the variable $n in run.pbs.template. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 27
PBS Run Multiple Jobs. [jtl3@fire1 pbs]$ sh run-many.sh 5989.fire1 5990.fire1 5991.fire1 5992.fire1 5993.fire1 sh the script you created Any questions about PBS multiple job submission? August 5, 2003 Beowulf Training Running Multiple Jobs Slide 28
Condor Run Multiple Jobs Example condor submit allows the user to override statements in the submission le. Use the -a ag This makes our scripting life easier we don't need to use sed August 5, 2003 Beowulf Training Running Multiple Jobs Slide 29
Condor Run Multiple Jobs. Step #1 Create the Condor submission le Note no arguments or output lines! executable = pi1-standard universe = standard notification = Complete notify_user = jtl3@lehigh.edu getenv = True rank = kflops queue August 5, 2003 Beowulf Training Running Multiple Jobs Slide 30
The Condor Multiple Job Submission Script Create the condor multiple job submission script Note the use of the -a option! #!/bin/bash for n in 100 1000 10000 100000 1000000 do condor_submit -a "arguments = $n" -a "output = pi.$n.out"\ run.condor.many done August 5, 2003 Beowulf Training Running Multiple Jobs Slide 31
Multiple Condor Submission Example [jtl3@fire1 condor]$ sh run-many.sh Submitting job(s). 1 job(s) submitted to cluster 32. Submitting job(s). 1 job(s) submitted to cluster 33. Submitting job(s). 1 job(s) submitted to cluster 34. Submitting job(s). 1 job(s) submitted to cluster 35. Submitting job(s). 1 job(s) submitted to cluster 36. [jtl3@fire1 condor]$ condor_q -- Submitter: fire1.cluster : <192.168.0.1:32777> : fire1.cluster ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 33.0 jtl3 8/4 12:16 0+00:00:01 R 0 3.4 pi1-standard 1000 34.0 jtl3 8/4 12:16 0+00:00:00 R 0 3.4 pi1-standard 10000 35.0 jtl3 8/4 12:16 0+00:00:00 I 0 3.4 pi1-standard 10000 36.0 jtl3 8/4 12:16 0+00:00:00 I 0 3.4 pi1-standard 10000 4 jobs; 2 idle, 2 running, 0 held August 5, 2003 Beowulf Training Running Multiple Jobs Slide 32
The End! Schedulers are required for use in a parallel computing environment PBS and Condor are cool You can do parallel computing even with MPI The Beowulf cluster can by a CPU cycle server for your research! August 5, 2003 Beowulf Training Running Multiple Jobs Slide 33