Using Parallel Computing to Run Multiple Jobs

Transcription

1 Beowulf Training Using Parallel Computing to Run Multiple Jobs Jeff Linderoth August 5, 2003 August 5, 2003 Beowulf Training Running Multiple Jobs Slide 1

2 Outline Introduction to Scheduling Software The Wonderful World of PBS The Equally Wonderful World of Condor Lab Time. Why do we need scheduling software? August 5, 2003 Beowulf Training Running Multiple Jobs Slide 2

3 Resource Scheduling So people don't ght over the resources! Schedulers... Locate appropriate resources, Manage resources, so multiple processes don't conict over the same processor Ensure a fairness policy, Are integrated with accounting software. The schedulers on the Beowulf cluster are PBS and Condor. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 3

4 Mmmmmmmmmmmmmm. Pie Our rst computational task will be to estimate π by numerical integration. Everyone knows x 2 dx = arctan(x) 1 x=0 = arctan(1) = π 4. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 4

5 The Rectangle Rule /(1+x*x) August 5, 2003 Beowulf Training Running Multiple Jobs Slide 5

6 A Program to Estimate π I've written a π-calculator for you. cd mkdir compute-pi cd compute-pi cp /tmp/training/session2/pi1.c. gcc pi1.c -lm -o pi1./pi This is not a parallel program. Just a simple (one process) program. Nevertheless, we must submit it through a scheduling system to run it on the Beowulf cluster. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 6

7 Running with PBS A simple four step process... Create a PBS submission script Submit the script to the PBS system using the command qsub PBS runs the script on the rst available resources PBS collects output for user's inspection August 5, 2003 Beowulf Training Running Multiple Jobs Slide 7

8 The PBS Submission Script Overview (1) You make a request for resources, (2) PBS will allocate a node pool to fulll your request. (3) Now you have to tell the node pool what to do! Both steps (1) and (3) are accomplished through the PBS submission script The script contains PBS request statements Shell commands that will run your job on the allocated resources. The shell commands are executed on the rst node in your allocated nodes August 5, 2003 Beowulf Training Running Multiple Jobs Slide 8

9 Our First PBS Submission Script #PBS -q small #PBS -l nodes=1:public #PBS -l cput=00:05:00 #PBS -V echo "The PBS job ID is: ${PBS_JOBID}" echo "The PBS Node File is" cat $PBS_NODEFILE $HOME/compute-pi/pi1 100 August 5, 2003 Beowulf Training Running Multiple Jobs Slide 9

10 Format of the PBS Submission Script Lines that begin with #PBS are PBS directives Everything else is a shell command Shell commands are just things that you would type at the regular login-prompt. But you can also do fancy looping and conditions. chapter/bashref toc.html After the PBS commands, you put any commands you would like. Usually the command to run your program is usually a good one to include. :-) Again, this is executed on the rst node. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 10

11 Breaking It Down. PBS Directives -q Species the queue in which to place the job. We have two queues, small and large small Max CPU time 20 minutes/process. large Lower priority than jobs in small queue -l Denes the resources that are required by the job and establishes a limit to the amount of resource that can be consumed. -V Declares that all environment variables in the qsub command's environment are to be exported to the batch job. If you would like the PBS job to inherit the same environment as the one you are currently running in (same PATH variable, etc), you should include this directive. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 11

12 The -l Story For resources, you will typically only need to declare the number of nodes, which class of nodes you request #PBS -l nodes=4:public the maximum cpu time #PBS -l cput=00:15:00 For the truly brave and curious the command is man pbs resources August 5, 2003 Beowulf Training Running Multiple Jobs Slide 12

13 PBS The Big Three qsub Submit a PBS job qstat Check the status of a PBS job qdel Delete a PBS job man <command> will give you lots more information August 5, 2003 Beowulf Training Running Multiple Jobs Slide 13

14 Let's do it! compute-pi-1]$ qsub run.pbs 5972.fire1 compute-pi-1]$ qstat -a fire1: Req d Req d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time fire1 jtl3 small run.pbs :20 E -- Note that the job ID is printed for you when you submit the job qstat -a : Shows the status of all jobs August 5, 2003 Beowulf Training Running Multiple Jobs Slide 14

15 Looking at the Output By default standard output goes to <scriptname>.o<job number> By default standard error goes to <scriptname>.e<job number> compute-pi-1]$ cat run.pbs.o5972 The PBS job ID is: 5972.fire1 The PBS Node File is fire34 pi is about Error is e-02 Note how the PBS environment variables are interpreted in the script. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 15

16 Other Cool PBS Stuff You May Want To Do #PBS -N <Name> : Name your job #PBS -o <File.out> : Redirect standard output to File.out #PBS -e <File.err> : Redirect standard error to File.err #PBS -m -M : Mail options Job dependencies For a list of all PBS command le options... man qsub Any PBS Questions? August 5, 2003 Beowulf Training Running Multiple Jobs Slide 16

17 Condor For purposes of this discussion, think of Condor as a different scheduler. Condor is a bit more fancy. Used often for nondedicated resources. (Will run only when no one else would use the machine). Checkpointing/Migration Remote I/O Likely, the accounting charge will be less for jobs submit to the Condor scheduler. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 17

18 Checkpointing/Migration Professor s Machine Professor Arrives } 5 min 5am 8am Grad Student s Machine Checkpoint Server Grad Student Arrives Grad Student Leaves 8:10am } 12pm 5 min August 5, 2003 Beowulf Training Running Multiple Jobs Slide 18

19 Condor Universes Condor jobs are submit to a specic Condor Universe Standard Has cool features like checkpointing and migration of jobs Requires special linking of your program Vanilla No cool condor features (regular) MPI/PVM Not mentioned here today, but they exist. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 19

20 Compiling for Condor Standard Universe Put the command condor compile in front of your normal link line. condor]$ condor compile gcc pi1.c -o pi1-standard -lm Vanilla Universe Do nothing Now Condor submission is like PBS submission Different command (job description) le Different submission/montoring commands August 5, 2003 Beowulf Training Running Multiple Jobs Slide 20

21 A Sample Condor Submission File universe = standard executable = pi1-standard arguments = output = pi1.out error = pi1.err notification = Complete notify_user = jtl3@lehigh.edu getenv = True rank = kflops queue man condor submit August 5, 2003 Beowulf Training Running Multiple Jobs Slide 21

22 The Big Four condor submit <job.condor> Submit a job to the Condor scheduler condor q Check the status of the queue of Condor jobs condor status Check the status of the condor pool condor rm <jobid> Delete a Condor job August 5, 2003 Beowulf Training Running Multiple Jobs Slide 22

23 Let's Do It! condor]$ condor_submit run.condor Submitting job(s). 1 job(s) submitted to cluster 16. [jtl3@fire1 condor]$ condor_q -- Submitter: fire1.cluster : < :32777> : fire1.cluster ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 16.0 jtl3 8/4 11: :00:16 R pi1-standard [jtl3@fire1 condor]$ cat pi1.out pi is about Error is e-09 I could do condor rm 16.0 Any Condor questions? August 5, 2003 Beowulf Training Running Multiple Jobs Slide 23

24 Quit Wasting My Time! OK, Linderoth, I thought today was supposed to be about parallel computing! That will be the focus of the next section(s) For now, let's do some simple parallel computing. Suppose I'd like to run the same executable pi1, but with many different input les or parameters. Use the multiple processors to get your work done faster August 5, 2003 Beowulf Training Running Multiple Jobs Slide 24

25 Running Many Jobs We need a way to easily submit many different jobs We will use the shell's scripting capabilities PBS Use a template command le and the sed utility Condor Use the -a ag to condor submit August 5, 2003 Beowulf Training Running Multiple Jobs Slide 25

26 PBS Run Multiple Jobs. Step #1 Create a template submission le. #!/bin/bash #PBS -q small #PBS -l nodes=1:public #PBS -l walltime=00:05:00 #PBS -V echo "The PBS job ID is: ${PBS_JOBID}" echo "The PBS Node File is" cat $PBS_NODEFILE /home/jtl3/class/pbs/pi1 XXX N XXX August 5, 2003 Beowulf Training Running Multiple Jobs Slide 26

27 PBS Run Multiple Jobs. Step #2 Create a shell script to do the multiple submission #!/bin/bash for n in do sed s/xxx_n_xxx/$n/g run.pbs.template > run.pbs.tmp qsub run.pbs.tmp rm run.pbs.tmp done The sed commands replaces all occurances of the pattern XXX N XXX with the variable $n in run.pbs.template. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 27

28 PBS Run Multiple Jobs. pbs]$ sh run-many.sh 5989.fire fire fire fire fire1 sh the script you created Any questions about PBS multiple job submission? August 5, 2003 Beowulf Training Running Multiple Jobs Slide 28

29 Condor Run Multiple Jobs Example condor submit allows the user to override statements in the submission le. Use the -a ag This makes our scripting life easier we don't need to use sed August 5, 2003 Beowulf Training Running Multiple Jobs Slide 29

30 Condor Run Multiple Jobs. Step #1 Create the Condor submission le Note no arguments or output lines! executable = pi1-standard universe = standard notification = Complete notify_user = jtl3@lehigh.edu getenv = True rank = kflops queue August 5, 2003 Beowulf Training Running Multiple Jobs Slide 30

31 The Condor Multiple Job Submission Script Create the condor multiple job submission script Note the use of the -a option! #!/bin/bash for n in do condor_submit -a "arguments = $n" -a "output = pi.$n.out"\ run.condor.many done August 5, 2003 Beowulf Training Running Multiple Jobs Slide 31

32 Multiple Condor Submission Example condor]$ sh run-many.sh Submitting job(s). 1 job(s) submitted to cluster 32. Submitting job(s). 1 job(s) submitted to cluster 33. Submitting job(s). 1 job(s) submitted to cluster 34. Submitting job(s). 1 job(s) submitted to cluster 35. Submitting job(s). 1 job(s) submitted to cluster 36. [jtl3@fire1 condor]$ condor_q -- Submitter: fire1.cluster : < :32777> : fire1.cluster ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 33.0 jtl3 8/4 12: :00:01 R pi1-standard jtl3 8/4 12: :00:00 R pi1-standard jtl3 8/4 12: :00:00 I pi1-standard jtl3 8/4 12: :00:00 I pi1-standard jobs; 2 idle, 2 running, 0 held August 5, 2003 Beowulf Training Running Multiple Jobs Slide 32

33 The End! Schedulers are required for use in a parallel computing environment PBS and Condor are cool You can do parallel computing even with MPI The Beowulf cluster can by a CPU cycle server for your research! August 5, 2003 Beowulf Training Running Multiple Jobs Slide 33