Job Scheduling on a Large UV 1000 Chad Vizino SGI User Group Conference May 2011
Overview About PSC s UV 1000 Simon UV Distinctives UV Operational issues Conclusion
PSC s UV 1000 - Blacklight
Blacklight after router installation
Lots of cables
Blacklight Hardware Installed September 2010 Routers installed December 2010 2 x 16TB SSIs 128 Blades per SSI 8GB per core 2048 physical cores per SSI Dual socket 8-way Intel Xeon 2.27GHz (Nahalem) 16 physical cores per blade 32 Hyper-Threaded cores per blade
Current SCRATCH Lustre 1.8.5 92TB 8 Servers IB SDR connection via Blacklight 2 x DDN 8550 New deployment coming
New SCRATCH Imminent deployment Runs drives at 95% of spindle speed See Michael Levine s talk on Blacklight
User Environment Login node Dual quad core Intel Xeon 2.4GHz (Westmere) 24 GB memory Common /usr/users ($HOME), /usr/local (packages managed with modules )
Login node Access/edit files Compile codes Submit/monitor jobs Users may not login to compute nodes (UV SSIs) Interactive jobs via qsub I are allowed Runs Torque server and scheduler processes
Software SUSE Linux Enterprise Server 11.1 SGI Performance Suite 1.1 Torque Resource Manager 2.3.13 (with local mods) Simon scheduler (locally developed)
About Simon Locally developed job scheduler Work started 10 years ago Integrated with Torque Ported to various architectures Compaq AlphaServer SC (RMS) Cray XT3 (CPA) SGI Altix 4700 (cpusets)
UV Distinctive #1: Cpusets Job assigned to whole blades Users request ncpus and walltime limits Get more memory by requesting more blades Memory enforcement Job killed when cpuset memory_pressure > 0 Cpuset is cpu exclusive Cpuset is mem exclusive Lessons learned from Altix 4700 experience
Cpusets facilitate repeatable performance
Hard to achieve repeatable performance!
More on Simon Written in TCL About 4,200 lines of code Integration with Torque Backfill Reservations Stuffing control (QOS) Co-scheduling software licenses Flexible walltime support
Torque Integration Features Linux kernel job integration Mom calls job_create() with Torque job id Enables use of ja by users csacom j `printf %x <torque_job_id>` Limiting process threads Java garbage collection threads -XX:ParallelGCThreads=N Thread_factor set on queue Limit = thread_factor*ncpus
Distinctive #2: Dealing with Hyper-Threads
Hyper-Threads and Jobs Users specify physical core count qsub l ncpus=n! N must be multiple of 16 PBS_NCPUS (N) PBS_HT_NCPUS (N*2) mpirun np $PBS_NCPUS! Or, mpirun np $PBS_HT_NCPUS! Mom daemon creates cpuset with Hyper- Thread cpu count (N*2)
CPU Numbering from topology output CPU Blade PhysID CoreID APIC-ID Family Model Speed L1(KiB) L2(KiB) L3(KiB)! --------------------------------------------------------------------------------! 0 r001i01b00 00 00 0 6 46 2267 32d/32i 256 24576! 1 r001i01b00 00 01 2 6 46 2267 32d/32i 256 24576! 2 r001i01b00 00 02 4 6 46 2267 32d/32i 256 24576! 3 r001i01b00 00 03 6 6 46 2267 32d/32i 256 24576!...! 13 r001i01b00 01 09 50 6 46 2267 32d/32i 256 24576! 14 r001i01b00 01 10 52 6 46 2267 32d/32i 256 24576! 15 r001i01b00 01 11 54 6 46 2267 32d/32i 256 24576!...! 2048 r001i01b00 00 00 1 6 46 2267 32d/32i 256 24576! 2049 r001i01b00 00 01 3 6 46 2267 32d/32i 256 24576! 2050 r001i01b00 00 02 5 6 46 2267 32d/32i 256 24576! 2051 r001i01b00 00 03 7 6 46 2267 32d/32i 256 24576! 2052 r001i01b00 00 08 17 6 46 2267 32d/32i 256 24576!...! 2060 r001i01b00 01 08 49 6 46 2267 32d/32i 256 24576! 2061 r001i01b00 01 09 51 6 46 2267 32d/32i 256 24576! 2062 r001i01b00 01 10 53 6 46 2267 32d/32i 256 24576! 2063 r001i01b00 01 11 55 6 46 2267 32d/32i 256 24576!!
Blade scheduling
System Hierarchy and Scheduling Rack IRU Blade Memnode Cpus Boot blade (1 st blade of each SSI) not scheduled IO blades (have IB cards) not scheduled Simon maintains list of free and in-use memory nodes per SSI Simon manipulates nodeset resource
Nodeset resource used for job placement Simon places jobs using nodeset Mems:cpus nodeset=2-3:16-31,2064-2079! Used by pbs_mom to construct cpuset on Blacklight node Queues can have a memnode mask Target specific memnodes (blades) Debug jobs on blade 127 (1/2 memory) Also on other nodes with < 128GB (full memory)
PMM A text based monitor 1 (bl0) 2 (bl1)=partition! 1 2 3 4 5 6 7 8=RACK! IRU -------- -------- -------- -------- -------- -------- -------- --------! ******** ******** ******** *******. ******** ******** ******** *******.! ******** ******** ******** ******** ******** ******** ******** ********! 23 -------- -------- -------- -------- -------- -------- -------- --------! ******** ******** ******** ******** ******** ******** ******** ********! B*xxx*** ******** ******** ***.**** B*xxx*** ******** ******** ********! 01 -------- -------- -------- -------- -------- -------- -------- --------! 4567CDEF=HEX BLADE # Key: *=allocated B=boot! 012389AB.=free x=not scheduled!
Blacklight Racks
Blacklight IRUs
Blacklight 3D Monitor See Blacklight3DMonitor.avi
UV Distinctive #3: Lots of Hardware
Database Holds Static Configuration Data SQLite SQL database engine Provides one place to get SSI configuration information for both SSIs Easy access to topology command output Each SSI Integration with Simon planned Used by pmm and Blacklight 3D Monitor
Database Tables (all in under 500 kilobytes!) Partitions Blades Cpus Cpusets Memnodes Devices Routers
Partitions Table sqlite> select * from partitions limit 1;! partition_num = 1! serial = UV-00000071! hostname = bl0.psc.teragrid.org! blades = 128! routers = 96! cpus = 4096! mem_total_gb = 16060.64! io_risers = 5! infiniband_controllers = 6! network_controllers = 2! scsi_controllers = 1! usb_controllers = 8! vga_gpus = 1!!
Blades Table sqlite> select * from blades limit 1;! blade_num = 0! partition_num = 1! blade_name = r001i01b00! rack = 1! iru = 1! blade = 0! asic = UVHub 2.0! nasid = 0! cpus = 32! memory_kb = 132077200! configured = 0! comment = boot!
Cpusets and Memnodes Tables sqlite> select * from cpusets limit 1;! cpuset_num = 0! partition_num = 1! cpuset_name = boot! mems = 0-1! cpus = 0-15,2048-2063!!! sqlite> select * from memnodes limit 1;! memnode_num = 0! blade_num = 0! partition_num = 1! cpuset_num = 0! mem_total_kb = 64968336!
Memnodes and Cpus Tables sqlite> select * from memnodes limit 1;! memnode_num = 0! blade_num = 0! partition_num = 1! cpuset_num = 0! mem_total_kb = 64968336!! sqlite> select * from cpus limit 1;! cpu_num = 0! memnode_num = 0! blade_num = 0! partition_num = 1! cpuset_num = 0! physid = 0! coreid = 0! apic_id = 0! family = 6! model = 46! speed = 2266! l1 = 32d/32i! l2 = 256! l3 = 24576!
Devices and Routers Tables sqlite> select * from devices limit 1;! blade_num = 0! partition_num = 1! pci_address = 0000:01:00.0! x_server_display = -! device = Intel 82576 Gigabit Network Connection!!! sqlite> select * from routers limit 1;! router_num = 0! partition_num = 1! router_name = r001i01r00! rack = 1! upos = 1! router = 0! class = NL5Router!
Database Queries Facilitates blade name and cpu/memnode translation Look up last job use by blade Helps answer: What blades did a job use? What memnodes and partition correspond to a given blade name? Which blades have less memory than expected after boot?
Operations: Pre-job scan (prologue script) Cpuset coherency at startup Tmpfs Ram based file system based on cpuset s memory /dev/tmpfs/<job_id> directory Created at job start Destroyed at job end (also scan for orphans in prologue) Lustre check Save job script for future reference
Operations: Memory failures Check at boot time via topology command difference checker Watch memlog via Simple Event Correlator (SEC) SEC updates system db so we can keep track of failures Provides place holder so we don t forget about them Remove from db after hardware replaced
Future Plans Develop database integration Predictive walltime scheduling Mitigate long drain times D. Tsafrir, Y. Etsion, and D. G. Feitelson, Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel & Distributed Syst. 18(6), pp. 789-803, Jun 2007. Topology-aware scheduling algorithms
More PSC! Michael Levine giving customer keynote on Blacklight, Wednesday at 9:00am.