Resource Management and Job Scheduling

Size: px

Start display at page:

Download "Resource Management and Job Scheduling"

Job Blankenship
8 years ago
Views:

1 Resource Management and Job Scheduling Jenett Tillotson Senior Cluster System Administrator Indiana University May May

2 Resource Managers Keep track of resources Nodes: CPUs, disk, memory, swap, load, etc. Network, licenses, storage, etc. Keep track of requests Jobs, queues, etc. Control jobs which use these resources Stop, hold, cancel, monitor, etc. May May

Keep track of requests Jobs, queues, etc.

3 Job Scheduler What jobs run on what resources Pretty complicated Quality of Service/Service Level Agreements Avoid job starvation Job placement Maximize good stuff Minimize bad stuff May May

Agreements Avoid job starvation Job placement

4 TORQUE Manager Terascale Open-source Resource and QUEue Portable Batch System (PBS), NASA, 1991 OpenPBS, open source, 1998 PBSPro, commercial product TORQUE, open source, 2003 Hosted and developed by Adaptive Computing May May

source, 1998 PBSPro, commercial product TORQUE, open source,

5 Moab Maui, mid 1990s, open sourced 2000 Moab, commercial product, 2001 Dave Jackson, creator of Maui/Moab Started Cluster Resources Now Adaptive Computing May May

6 Torque Topology Diagram May May

7 Master Node pbs_server Provides Node tracking Queues and queuing policies Storage for job scripts and tracking of jobs Usage and events logs pbs_sched: FIFO scheduler May May

8 Compute Nodes pbs_mom: Machine Oriented Mini-server Starts the job on the compute resources Monitors resource utilizations Notifies pbs_server of job events Facilitates multi-node jobs Spools stdout and stderr Mother Superior and sister MOMs May May

pbs_server of job events Facilitates multi-node jobs Spools stdout

9 Submit Nodes TORQUE client qsub, qdel, qhold/qrls, qstat, qalter All nodes trqauthd: TORQUE Authorization Daemon Runs on all nodes May May

10 Job Flow May May

11 Job Flow May May

12 Job Flow May May

13 Job Flow May May

14 Job Flow May May

15 Installation Requires libxml2-devel, openssl-devel, Tcl/Tk for the (optional) GUI, libhwloc for (optional) cpusets, gcc, gcc-c++, make, libtool, boostdevel configure; make; make install make install_mom, make install_client, make install_server make rpm -or- make packages May May

libtool, boostdevel configure; make; make install make install_mom, make

16 Configuring TORQUE./configure options: --prefix=/usr/local/ --home_server_home=/var/spool/torque/ --with-default-server=$hostname pbs_server: /var/spool/torque/server_priv/nodes pbs_mom: /var/spool/torque/mom_priv/config /var/spool/torque/server_name May May

17 /var/spool/torque/server_priv/nodes: node1 np=16 prop1 prop2 node2 np=16 prop1 node3 np=32 prop3 prop2 node4 np=16 prop1 prop2 May May

18 /var/spool/torque/mom_priv/config: $loglevel 3 $spool_as_final_name true $usecp *:/N/home /N/home $usecp *:/N/dc2 /N/dc2 May May

19 /var/spool/torque/server_name: myresmgr.domain.edu May May

20 Running TORQUE Startup the first time: pbs_server -t create pbs_mom, trqauthd Startup scripts are in $BUILD_DIR/contrib/ Testing pbsnodes qmgr /var/spool/torque/server_logs /var/spool/torque/mom_logs May May

21 Security Compute nodes and submit hosts must be able to talk to port on the pbs_server pbs_server must be able to talk to port on the compute nodes The compute nodes must be able to talk to port on the compute nodes May May

22 TORQUE Configuration - qmgr create queue foo set queue foo queue_type = Execution set queue foo resources_max.nodes = 32 set queue foo resources_max.walltime = 24:00:00 set queue foo resources_default.nodes = 1 set queue foo resources_default.walltime = 1:00:00 set queue foo enabled = True set queue foo started = True May May

23 TORQUE Configuration (cont.) set server scheduling = True set server acl_host_enable = True set server acl_hosts = myresmgr set server managers = root@myresmgr set server operators = root@myresmgr set server submit_hosts = mysubmithost May May

24 TORQUE Configuration (cont.) set server default_queue = foo set server log_events = 511 set server mail_from = adm set server node_check_rate = 150 set server tcp_timeout = 6 May May

25 Torque Commands qstat: Used to query the resource manager. Common usage: qstat -f $JOBID : displays full info for $JOBID. qstat -a : displays all jobs. qstat -q : displays queues status. qstat -Qf : display queue definitions. May May

26 TORQUE Commands (cont.) pbsnodes: command used to query the state of nodes, mark a node offline, or online. pbsnodes -o $NODE : sets the $NODE offline pbsnodes -r $NODE : clears the offline state pbsnodes -l : lists all nodes that are down or offline pbsnodes -l $STATE : lists all node in state $STATE May May

27 Job Script #!/bin/bash #PBS -l nodes=2:ppn=16 #PBS -l walltime=2:00:00 #PBS -N myjobname #PBS -m bea #PBS -M #PBS -j oe #PBS -k o #PBS -V #PBS -q foo cd $PBS_O_WORKDIR./runmyjob May May

28 -l -N TORQUE Directives resource requests job name -m when to mail (b: start, e: end, a: abort, n: none) -M where to mail -j -k join output streams keep output stream -V copy submission environment to compute node -q queue to submit to May May

29 Job Environment Variables PBS_O_HOST - The machine that submitted the job. PBS_O_LOGNAME - The user who submitted the job. PBS_O_HOME - The home directory of the user who submitted the job. PBS_O_WORKDIR - The working directory from where the qsub was run. PBS_ENVIRONMENT - Set to PBS_BATCH for batch jobs and to PBS_INTERACTIVE for interactive jobs. PBS_O_QUEUE - The original queue to which the job was submitted. PBS_JOBID - The identifier that PBS assigns to the job. PBS_JOBNAME - The name of the job. PBS_NODEFILE - The file which contains the list of nodes assigned to the job. May May

30 Job Control qsub submit a job to the queues qdel delete a job from the queues qhold put a job on hold qrls release a hold qstat job status qalter alter the attributes of an idle job May May

31 qsub -I Submitting a Job Submits an interactive job qsub $JOB_SCRIPT_FILE qsub -l nodes=1:ppn=16 -l walltime=2:00:00 -q foo -N myname $JOB_SCRIPT_FILE Directives on the command line will override the directives in the job script Jobs spooled in /var/spool/torque/server_priv/jobs May May

32 Job Scheduling pbs_sched : Simple FIFO scheduler qrun Terminating TORQUE qterm -t quick : Leave jobs running qterm -t immediate : Terminate all jobs as well May May

33 Troubleshooting tracejob -n $NUMB_OF_DAYS $JOB_ID Logs /var/spool/torque/server_logs /var/spool/torque/mom_logs /var/spool/torque/client_logs /var/spool/torque/server_priv/accounting /var/spool/torque/job_logs May May

34 Moab Workload Manager May May

35 Installation Download from Adaptive Computing libcurl, perl, perl-cpan, libxml2-devel, torque configure; make; make install Configure options --prefix=/opt/moab --with_homedir=/opt/moab --with-serverhost=$hostname --with-torque=/usr/local May May

36 moab.cfg SCHEDCFG[mysched] SERVER=mysched:42559 ADMINCFG[1] USERS=root ADMINCFG[3] USERS=all RMCFG[myresmgr] TYPE=PBS RMCFG[myresmgr] SUBMITCMD=/usr/local/bin/qsub RMCFG[myresmgr] TIMEOUT=00:05:00 May May

37 moab.cfg LOGLEVEL 3 LOGFILEMAXSIZE LOGFILEROLLDEPTH 10 RMPOLLINTERVAL 15 DISABLESCHEDULING TRUE May May

38 moab.cfg JOBNODEMATCHPOLICY EXACTNODE NODEALLOCATIONPOLICY PRIORITY NODEACCESSPOLICY SINGLEJOB JOBREJECTPOLICY HOLD DEFERTIME 00:15:00 DEFERCOUNT 5 JOBACTIONONNODEFAILURE REQUEUE May May

39 moab.cfg PROCWEIGHT 10 XFACTORWEIGHT 1000 FSWEIGHT 3 FSUSERWEIGHT 1000 FSPOLICY DEDICATEDPS FSDEPTH 7 FSINTERVAL 24:00:00 FSDECAY 0.80 May May

40 moab.cfg RESERVATIONPOLICY CURRENTHIGHEST RESERVATIONDEPTH 10 BACKFILLPOLICY FIRSTFIT May May

41 moab.cfg USERCFG[DEFAULT] FSTARGET=10.0 USERCFG[DEFAULT] MAXIJOBS=16 CLASSCFG[foo] HOSTLIST=node1[0-9]$ CLASSCFG[foo] MAXNODEPERUSER=4 CLASSCFG[foo] MAXJOB[USER]=1 NODECFG[DEFAULT] PRIORITYF=-LOAD May May

42 Running moab mdiag -C : Will check moab.cfg for errors /opt/moab/sbin/moab Startup scripts are in $BUILD_DIR/contrib May May

43 Troubleshooting mdiag -R : Shows what moab thinks is the status of the resource manager showq : shows jobs in the Running, Idle, and Blocked moab queues checkjob -v $JOB_ID checknode $NODE_ID showstart $JOB_ID Logs are in /opt/moab/log May May

44 Controlling moab mschedctl -p : Pauses moab mschedctl -r : Starts moab mschedctl -R : Re-reads moab.cfg mschedctl -k : Kill moab mschedctl -L 7 : Sets log level May May

45 Moab Client Installed just like on the server Requires just the following line in moab.cfg: SCHEDCFG[mysched] SERVER=mysched:42559 msub, mjobctl submit and control jobs through moab instead of the resource manager ADMINCFG[3] users allowed to run query commands (checknode, checkjob, etc.) May May

46 Examples May May

47 External Resources Moab Information, Download, and Docs: Torque Information, Download, Docs and User Community Lists: -source/torque May May

Ra - Batch Scripts. Timothy H. Kaiser, Ph.D. tkaiser@mines.edu

Ra - Batch Scripts. Timothy H. Kaiser, Ph.D. tkaiser@mines.edu Ra - Batch Scripts Timothy H. Kaiser, Ph.D. tkaiser@mines.edu Jobs on Ra are Run via a Batch System Ra is a shared resource Purpose: Give fair access to all users Have control over where jobs are run Set