S U P E R C O M P U T I N G 2 0 0 1 SUN GRID ENGINE
Grid Software Stack Global Grid: Avaki, Cactus, Globus, PUNCH Campus Grid: SGE Broker, SGE Enterpris e Edition Technic al Comput Distributed Resource Management: Sun Grid Engine Distributed Process Management: Clus tertools Infras tructure Adminis tratio n: iplanet, JXTA Development Tools: Forte Systems Administration: SunMC, ichange, Solaris Management Console Platform: Solaris Storage Manag e me nt: JIRO, QFS, SAM-FS, HPC-SAN
Sun Grid Engine Useful Links http://www.s un.c o m/g ridware Product Overview, FAQs, Web Based Training http://supportforum.sun.com/gridengine Supportforum on Installation/ Compute Farms/HPC Application Note s, FAQs http://gridengine.sunsource.net Open Source Project Ma iling Lis ts Courtesy Binaries
Sun Grid Engine Focus: Large, long running batch jobs # of users 10m 1m 100k 10k 1k 100 10 Transaction Processing Batch Processing msec Secs Mins Hours Days Sun Grid Engine Job Duratio
Overview User Jobs Results Sun Grid Engine Resource- Selection Resource allocation policy Job and resource priority Systems' characteristics and current state Dispatch TCF WS CS
Design Approach: Bank Analogy Loans Teller Teller Mortages Teller Loans Counters Queues Customers Jobs lower higher Waiting area Pending lis t
Simple Installation INSTALLATION OPTIONS Ma s te r host $CODINE_ROOT/ \_ default/ \_ (directories) Exec hosts Exec hosts Submit host Admin host Exec hosts
High Ava ila bility Insta lla tion INSTALLATION OPTIONS HA File server $CODINE_ROOT/ \_ default/ \_ (other directories) Submit host Admin host Ma s te r host Shadow master host Exec host
Archite cture compute hosts Solaris, Linux, etc. Solaris, Linux, etc. master host execd commd schedd qmaster
Host Types and Daemons shadowd execd Optional Mandatory commd shadowd schedd execd masterd commd Ma s te r-host Exe c-hos t Submit-Host Admin-Hos t
Queues Slots Host Queues Slots Queue: job container Jobs acquire attributes of the queue, e.g.: CPU, memory limits Priority Suspension (scheduled, manual) Auxillia ry s cripts Interactive, batch, parallel J ob S lots : a llow multiple jobs with same characteristics on same host
Information Flow 2) qmaster 3) Job qsub Schedd Execd Execd Execd 4) Dispatch 1) Submit 7) Inform when done 8) Re cord 6) Control 5) Load Re port accounting
Scheduling Schedd job s e le ction user sort first come, first serve qmaster queue selection resource match sequence number loa d formula # # STARHPC # Version 1# #$ -l cdlic=1 # cp xyz abc load formula (configurable by admin), eg load_avg slots fre e _mem
Job Execution Event Chain START queue/host prolog parallel start requeu e job queue/host epilog parallel stop migration command MIGRA TE resume method EXIT suspend method SUSPEND job script run at specifi ed clean command terminate method DELETE checkpoint command parallel stop queue/host epilog END
Complexes OVERVIEW complex: a related group of resources Host num_proc load_avg mem_total swap_free Queue mem limit tmp dir slots susp sched Global floa ting license s user-defined Resource types (examples) consumable (swap_free) or fixed (num_proc) re ques ta ble (license, mem limit) qsub -l job_type=synth,license=1,mem_limit=1g
Example: Host Complex
Complexes: Resource Attribute DEFINITIONS s variable name shortcut type HOST value relop request consum default comment must be unique across all complexes unique, may be used as a replacement everywhere STRING, INT, BOOL, MEMORY, DOUBLE, TIME, CSTRING, default value, may be overriden by actual value or load report relational operator =< < == > >=!= if YES, user can request; if FORCED, user MUST request attribute is consumable default request if resource is marked consumable
Load Sensors: tracking arbitrary resources Host-specific & cluster-wide information. Examples: "free disk space on arbitrary partition" (hosts pecific) cluster "number of licenses of a particular SW in use" (cluster-wide) Custom script/program queried periodically compute host Information use d for resource requests load thresholds compute host compute host load formula
Resource Matching OPERATION OF COMPLEXES Can a jobs run in a particular queue? if (user-request RELOP actual-value) then YES Example 1: qsub -l arch=solaris64 myjob.sh actual arch=glinux RELOP for arch is == if (solaris64 == glinux) then schedule job. Result: job does NOT get scheduled Example 2: qsub -l h_vmem=256m myjob.sh actual h_vmem=512m RELOP for h_vmem is <= if (256m <= 512m) then schedule job. Result: job gets scheduled
Inheritance of Resources global re s A=val1 host re s A=val6 queue queue re s C=val3 value everywhere overrides global value host re s B=val2 queue queue value on all queues on this host user-defined resd=val4 value wherever attached value for all jobs on this queue
Consumables SPECIAL TYPE OF RESOURCE Capacity management for limited resources Can use user-defined resource, or built-in load values, or va lue from load s e nsor "free memory" "amount of space in a scratch directory" "number of licenses" Consumption of resources determined in two ways: If REQUESTABLE is YES or FORCED, individual jobs will "consume" the specified amount of resources (either amount requested or DEFAULT amount) If REQUESTABLE is NO, amount "consumed" must be given by built-in load value or load sensor Jobs requesting unavailable resources will wait until they are fre e d
Example: Consumable initia l global, queue complex values jobs requesting resources updated global, queue complex values
Parallel and Checkpointing Environments OVERVIEW Environment a set of queues that is used to support parallel or checkpointing jobs q1 q2 q4 q6 Env A q3 q5 q7 Env
PE Configuration qmon
Parallel Environment codine_pe(5) qconf -sp <pe name> allocation rule setting <integer> $pe_slots on range> $fill_up $round_robin comment allocate exactly this many slots per host allocate as many slots on single host as stated command line: qsub -pe <pe name> <slot fill up one host, move to another, continue until range filled do round-robin allocation over all suitable hosts until range filled
Checkpoint Configuration checkpoint(5) qconf -sckpt <ckpt name>
Load Sensor: Free disk space #!/bin/sh script OR binary myhost=`$codine_root/utilbin/solaris64/gethostname -name` hostspecific end=false while [ $end = false ]; do read input result=$? if [ $result!= 0 ]; then end=true break fi done if [ "$input" = "quit" ]; then end=true break fi echo "begin" dfoutput=`df -k /var tail -1` varfree=`echo $dfoutput awk '{ print $4}'` echo "$myhost:varfree:${varfree}k" echo "end" loop until told to quit output inform ation in specific form at
Consumables: License management Problem: For a particular application, there are four floating licenses: two for long jobs, two for short jobs Requirement: Application should only run on two particular hosts out of the whole cluster jobs should only run if a license is available. at any one time there can only be four instances of the job running: two short jobs and two long. How would you manage this?
Consumables: License management create global consumables set them to zero e xplicitly where ve r they're not desired jobs naturally go to whatever queues remains global long=2 short=2 Host1 queuea long=0 long jobs run here queueb short=0 neither run here All Other hosts long=0 short=0 short jobs run here Host2 queuea long=0 queueb short=0
Queues: Resource sharing Problem: set base configuration to match needs Hardware 24 x Sun Fire 280R 2 CPU 1GB RAM Requirements interactive + fast jobs, no more than 16 at a time, runtime<=1hr, mem<=256m, up to 2 jobs per CPU long jobs, no time limit, mem<=512m, only one job per CPU suspend long jobs, if needed, to run short jobs, but try to avoid suspending whenever possible one job per user, unle s s re s ources idle
Queues: Resource sharing Complexes : add a new queue complex resource name type value relop req cons. default jobtype s tring NONE == YES NO long Queues on both the two hosts, set up two queues: long and short batch only queue; 2 slots, set h_vmem to 512M, set jobtype=long batch + interactive queue; 4 slots, make other queue subordinate, set h_rt to 600 seconds, h_vmem to 256M, set jobtype=short on all machines, number the queues in opposite ascending order, eg queue hosta hostb hostc hostd batch only 1 2 3 4 batch+int 4 3 2 1 Clus ter set user-sort set queue_sort_method to seqno disable batch+int queues on all but 4 systems (enable on others, eg, if a system goes down) to run short job: qsub -l jobtype=short shortjob.sh to run interactive job: qrsh /usr/local/bin/myjob to run long job: qsub -l jobtype=long longjob.sh or qsub longjob.sh