System Software for High Performance Computing Joe Izraelevitz
Agenda Overview of Supercomputers Blue Gene/Q System LoadLeveler Job Scheduler General Parallel File System HPC at UR
What is a Supercomputer? Lots of other computers Closely colocated on a managed network Backing store The World's Simplest Supercomputer (Beowulf Cluster) IPC Linux w/ rsh enabled Linux w/ rsh enabled
Key Concepts in Supercomputers Cluster: a grouping of computers Node: a computer within the cluster Job: a program instance (a set of processes)
Operating Systems for HPC Each computer in the cluster has an operating system Off the shelf Linux Red Hat, Windows Server Specialized Compute Node Linux, CNK, INK But the supercomputer can also have an operating system called the system management software, which manages its component nodes OS System Management Software Application
Operating Systems for HPC System Management Software Components Node Operating System (Linux, CNK, etc.) Message Passing (MPI, PVM) Job Scheduler (Maui Scheduler, LoadLeveler) Resource Manager (Torque Resource Manager, LSF, SLURM) Backing Store (AFS, DFS, GPFS) Front End UI Hardware Architecture
Blue Gene/Q Cluster IBM Flagship supercomputer Third generation Complete Supercomputer System Architecture System Management Software
Blue Gene/Q Architecture File I/O Network Front End UI CNK OS (Compute Node Kernel) - on 17 cores IPC Network Backing Store GPFS (General Parallel File System) System Management Software INK OS (IO Node Kernel)
Blue Gene System Management Software Job Scheduler: LoadLeveler Resource Manager: LoadLeveler Central Manager IPC: MPICH2 File System: GPFS OS: CNK, INK
Job Scheduling Maximize resource usage CPU cycles, RAM, storage space, software licenses Algorithms SJF, LJF, FIFO, High Priority, etc. Considerations Job type, OS Awareness, Scalability, Efficiency, Dynamic Capability, Preemption, OS Scheduling
Job Scheduler: LoadLeveler Built in Blue Gene/Q job scheduler Checkpoint support Priority Queues Priority from user group FIFO within jobs of equal priority Generally nonpremptible
LoadLeveler: LL_DEFAULT Double Queue w/ Advanced Reservation As nodes are freed, reserve them for the next job NEGOTIATOR_PARALLEL_HOLD: Specify the amount of time a job can hold onto a resource Serial programs queued separate from parallel Issues Under utilization Jobs may never get enough resources within the time allotted
LoadLeveler: BACKFILL Double Queue w/ Advanced Reservation w/ Wall Clock Limit Scheduler can determine when resources will be available Can backfill shorter jobs before large jobs Issues Priority Inversion Incorrect wall clock limit
LoadLeveler: GANG Coordinated time multiplex scheduler Each time slice a virtual machine Issues Increased run time Context switch overhead RAM limited
General Parallel File System (GPFS) Blue Gene/Q default file system Parallel access to files, file metadata Design considerations: Highly parallel access Bandwidth bottleneck Huge disks and files Compute Nodes I/O Nodes I/O Network Disk Array
GPFS Overview Striped Files Files stored in (~256K) blocks per disk Distributed in round robin fashion Massively parallel file retrieval, bandwidth limited Vulnerable to failure RAID redundancy on each disk Block File
GPFS: Read/Write File Parallelism in two methods Distributed lock manager Lock for byte ranges within file Lock tokens issued to I/O nodes Data Shipping RCU managed blocks within single file Metadata Parallelism One I/O node designated as metanode for file and maintains the inode information
GPFS: Allocate/Delete Allocation Manager Block Maintain bitmap of free blocks Issue region locks File Allocation Get region lock, check for free space File Deletion Region Requires update of allocation manager Requires clearing disk space while holding region lock Delayed distributed deletion across I/O nodes based on lock ownership File
GPFS: Disk Organization Extensible Hashing within directories Use n bits of hashing function to group files On collision, increase to n+1 and reorganize Journal file system on disk Shared journal, so any node can restore disk
Message Passing (MPI) MPI (Message Passing Interface) Standard (not a library) Implementations with compliant compilers: OpenMPI, MPICH, mpijava, pympi, etc. Superfork to all CPUs available Master Process Process Process MPI_INIT() Process MPI_INIT(), MPI_SEND(), MPI_RECV(), MPI_WAIT() (mostly) OS, Cluster Manager independent
Resource Manager Resources managers provide the low-level functionality to start, hold, cancel, and monitor jobs. Without these capabilities, a scheduler alone cannot control jobs. Daemon runs on each node on top of OS Layer of abstraction between OS and Message Passing Interface Interfaces with Job Scheduler Manages job submission, admin interface Monitors compute resources
HPC at U of R Blue Streak Blue Gene/Q System BG/P SLURM Blue Gene/P System LoadLeveler Resource Manager /Scheduler BlueHive Torque Resource Manager, Maui Scheduler Intel Blade Center System
Works Cited Barney, Blaise. Message Passing Interface (MPI). Lawrence Liverpool National Laboratory. https://computing.llnl.gov/tutorials/mpi/#abstract (2012). Center for Integrated Research Computing. Resources. University of Rochester. http://www.circ.rochester.edu/resources.html (2012). Gilge, Megan. IBM System Blue Gene Solution: Blue Gene/Q Application Development. International Technical Support Organization. IBM. March 2012. http://www.redbooks.ibm.com/redpieces/pdfs/sg247948.pdf Iqbal, Saeed, Rinku Gupta, Yung-Chin Fang. Planning Considerations for Job Scheduling in HPC Clusters. Dell Power Solutions, February 2005. http://www.dell.com/downloads/global/power/ps1q05-20040135-fang.pdf Lakner, Gary and Brant Knudson. IBM System Blue Gene Solution: Blue Gene/Q System Administration. International Technical Support Organization. IBM. June 2012. http://www.redbooks.ibm.com/redbooks/pdfs/sg247869.pdf Kannan, Subramanian, Mark Roberts, Peter Mayes, Dave Brelsford, Joseph F Skovira. Workload Management with LoadLeveler. International Technical Support Organization. IBM. November 2001. http://www.redbooks.ibm.com/redbooks/pdfs/sg246038.pdf Schmuck, Frank and Roger Haskin. GPFS: A Shared-Disk File System for Large Computing Clusters. Proceedings of the Conference on File and Storage Technologies (FAST 02), 28 30 January 2002, Monterey, CA, pp. 231 244. (USENIX, Berkeley, CA.)