Job Scheduling on a Large UV Chad Vizino SGI User Group Conference May Pittsburgh Supercomputing Center

Size: px

Start display at page:

Download "Job Scheduling on a Large UV 1000. Chad Vizino SGI User Group Conference May 2011. 2011 Pittsburgh Supercomputing Center"

Alicia Lee
8 years ago
Views:

1 Job Scheduling on a Large UV 1000 Chad Vizino SGI User Group Conference May 2011

2 Overview About PSC s UV 1000 Simon UV Distinctives UV Operational issues Conclusion

3 PSC s UV Blacklight

4 Blacklight after router installation

5 Lots of cables

6 Blacklight Hardware Installed September 2010 Routers installed December x 16TB SSIs 128 Blades per SSI 8GB per core 2048 physical cores per SSI Dual socket 8-way Intel Xeon 2.27GHz (Nahalem) 16 physical cores per blade 32 Hyper-Threaded cores per blade

7 Current SCRATCH Lustre TB 8 Servers IB SDR connection via Blacklight 2 x DDN 8550 New deployment coming

8 New SCRATCH Imminent deployment Runs drives at 95% of spindle speed See Michael Levine s talk on Blacklight

9 User Environment Login node Dual quad core Intel Xeon 2.4GHz (Westmere) 24 GB memory Common /usr/users ($HOME), /usr/local (packages managed with modules )

10 Login node Access/edit files Compile codes Submit/monitor jobs Users may not login to compute nodes (UV SSIs) Interactive jobs via qsub I are allowed Runs Torque server and scheduler processes

11 Software SUSE Linux Enterprise Server 11.1 SGI Performance Suite 1.1 Torque Resource Manager (with local mods) Simon scheduler (locally developed)

12 About Simon Locally developed job scheduler Work started 10 years ago Integrated with Torque Ported to various architectures Compaq AlphaServer SC (RMS) Cray XT3 (CPA) SGI Altix 4700 (cpusets)

13 UV Distinctive #1: Cpusets Job assigned to whole blades Users request ncpus and walltime limits Get more memory by requesting more blades Memory enforcement Job killed when cpuset memory_pressure > 0 Cpuset is cpu exclusive Cpuset is mem exclusive Lessons learned from Altix 4700 experience

14 Cpusets facilitate repeatable performance

15 Hard to achieve repeatable performance!

16 More on Simon Written in TCL About 4,200 lines of code Integration with Torque Backfill Reservations Stuffing control (QOS) Co-scheduling software licenses Flexible walltime support

17 Torque Integration Features Linux kernel job integration Mom calls job_create() with Torque job id Enables use of ja by users csacom j `printf %x <torque_job_id>` Limiting process threads Java garbage collection threads -XX:ParallelGCThreads=N Thread_factor set on queue Limit = thread_factor*ncpus

18 Distinctive #2: Dealing with Hyper-Threads

19 Hyper-Threads and Jobs Users specify physical core count qsub l ncpus=n! N must be multiple of 16 PBS_NCPUS (N) PBS_HT_NCPUS (N*2) mpirun np $PBS_NCPUS! Or, mpirun np $PBS_HT_NCPUS! Mom daemon creates cpuset with Hyper- Thread cpu count (N*2)

20 CPU Numbering from topology output CPU Blade PhysID CoreID APIC-ID Family Model Speed L1(KiB) L2(KiB) L3(KiB)! ! 0 r001i01b d/32i ! 1 r001i01b d/32i ! 2 r001i01b d/32i ! 3 r001i01b d/32i !...! 13 r001i01b d/32i ! 14 r001i01b d/32i ! 15 r001i01b d/32i !...! 2048 r001i01b d/32i ! 2049 r001i01b d/32i ! 2050 r001i01b d/32i ! 2051 r001i01b d/32i ! 2052 r001i01b d/32i !...! 2060 r001i01b d/32i ! 2061 r001i01b d/32i ! 2062 r001i01b d/32i ! 2063 r001i01b d/32i !!

21 Blade scheduling

22 System Hierarchy and Scheduling Rack IRU Blade Memnode Cpus Boot blade (1 st blade of each SSI) not scheduled IO blades (have IB cards) not scheduled Simon maintains list of free and in-use memory nodes per SSI Simon manipulates nodeset resource

23 Nodeset resource used for job placement Simon places jobs using nodeset Mems:cpus nodeset=2-3:16-31, ! Used by pbs_mom to construct cpuset on Blacklight node Queues can have a memnode mask Target specific memnodes (blades) Debug jobs on blade 127 (1/2 memory) Also on other nodes with < 128GB (full memory)

24 PMM A text based monitor 1 (bl0) 2 (bl1)=partition! =RACK! IRU ! ******** ******** ******** *******. ******** ******** ******** *******.! ******** ******** ******** ******** ******** ******** ******** ********! ! ******** ******** ******** ******** ******** ******** ******** ********! B*xxx*** ******** ******** ***.**** B*xxx*** ******** ******** ********! ! 4567CDEF=HEX BLADE # Key: *=allocated B=boot! AB.=free x=not scheduled!

25 Blacklight Racks

26 Blacklight IRUs

27 Blacklight 3D Monitor See Blacklight3DMonitor.avi

28 UV Distinctive #3: Lots of Hardware

29 Database Holds Static Configuration Data SQLite SQL database engine Provides one place to get SSI configuration information for both SSIs Easy access to topology command output Each SSI Integration with Simon planned Used by pmm and Blacklight 3D Monitor

30 Database Tables (all in under 500 kilobytes!) Partitions Blades Cpus Cpusets Memnodes Devices Routers

31 Partitions Table sqlite> select * from partitions limit 1;! partition_num = 1! serial = UV ! hostname = bl0.psc.teragrid.org! blades = 128! routers = 96! cpus = 4096! mem_total_gb = ! io_risers = 5! infiniband_controllers = 6! network_controllers = 2! scsi_controllers = 1! usb_controllers = 8! vga_gpus = 1!!

32 Blades Table sqlite> select * from blades limit 1;! blade_num = 0! partition_num = 1! blade_name = r001i01b00! rack = 1! iru = 1! blade = 0! asic = UVHub 2.0! nasid = 0! cpus = 32! memory_kb = ! configured = 0! comment = boot!

33 Cpusets and Memnodes Tables sqlite> select * from cpusets limit 1;! cpuset_num = 0! partition_num = 1! cpuset_name = boot! mems = 0-1! cpus = 0-15, !!! sqlite> select * from memnodes limit 1;! memnode_num = 0! blade_num = 0! partition_num = 1! cpuset_num = 0! mem_total_kb = !

34 Memnodes and Cpus Tables sqlite> select * from memnodes limit 1;! memnode_num = 0! blade_num = 0! partition_num = 1! cpuset_num = 0! mem_total_kb = !! sqlite> select * from cpus limit 1;! cpu_num = 0! memnode_num = 0! blade_num = 0! partition_num = 1! cpuset_num = 0! physid = 0! coreid = 0! apic_id = 0! family = 6! model = 46! speed = 2266! l1 = 32d/32i! l2 = 256! l3 = 24576!

35 Devices and Routers Tables sqlite> select * from devices limit 1;! blade_num = 0! partition_num = 1! pci_address = 0000:01:00.0! x_server_display = -! device = Intel Gigabit Network Connection!!! sqlite> select * from routers limit 1;! router_num = 0! partition_num = 1! router_name = r001i01r00! rack = 1! upos = 1! router = 0! class = NL5Router!

36 Database Queries Facilitates blade name and cpu/memnode translation Look up last job use by blade Helps answer: What blades did a job use? What memnodes and partition correspond to a given blade name? Which blades have less memory than expected after boot?

37 Operations: Pre-job scan (prologue script) Cpuset coherency at startup Tmpfs Ram based file system based on cpuset s memory /dev/tmpfs/<job_id> directory Created at job start Destroyed at job end (also scan for orphans in prologue) Lustre check Save job script for future reference

38 Operations: Memory failures Check at boot time via topology command difference checker Watch memlog via Simple Event Correlator (SEC) SEC updates system db so we can keep track of failures Provides place holder so we don t forget about them Remove from db after hardware replaced

39 Future Plans Develop database integration Predictive walltime scheduling Mitigate long drain times D. Tsafrir, Y. Etsion, and D. G. Feitelson, Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel & Distributed Syst. 18(6), pp , Jun Topology-aware scheduling algorithms

40 More PSC! Michael Levine giving customer keynote on Blacklight, Wednesday at 9:00am.

Batch Scheduling on the Cray XT3

Batch Scheduling on the Cray XT3 Chad Vizino, Nathan Stone, John Kochmar, J. Ray Scott {vizino,nstone,kochmar,scott}@psc.edu Pittsburgh Supercomputing Center ABSTRACT: The Pittsburgh Supercomputing Center