System Software for High Performance Computing. Joe Izraelevitz

Size: px

Start display at page:

Download "System Software for High Performance Computing. Joe Izraelevitz"

Crystal Roberts
9 years ago
Views:

1 System Software for High Performance Computing Joe Izraelevitz

2 Agenda Overview of Supercomputers Blue Gene/Q System LoadLeveler Job Scheduler General Parallel File System HPC at UR

3 What is a Supercomputer? Lots of other computers Closely colocated on a managed network Backing store The World's Simplest Supercomputer (Beowulf Cluster) IPC Linux w/ rsh enabled Linux w/ rsh enabled

4 Key Concepts in Supercomputers Cluster: a grouping of computers Node: a computer within the cluster Job: a program instance (a set of processes)

5 Operating Systems for HPC Each computer in the cluster has an operating system Off the shelf Linux Red Hat, Windows Server Specialized Compute Node Linux, CNK, INK But the supercomputer can also have an operating system called the system management software, which manages its component nodes OS System Management Software Application

But the supercomputer can also have an operating system called the system

6 Operating Systems for HPC System Management Software Components Node Operating System (Linux, CNK, etc.) Message Passing (MPI, PVM) Job Scheduler (Maui Scheduler, LoadLeveler) Resource Manager (Torque Resource Manager, LSF, SLURM) Backing Store (AFS, DFS, GPFS) Front End UI Hardware Architecture

) Message Passing (MPI, PVM) Job Scheduler (Maui Scheduler, LoadLeveler)

7 Blue Gene/Q Cluster IBM Flagship supercomputer Third generation Complete Supercomputer System Architecture System Management Software

8 Blue Gene/Q Architecture File I/O Network Front End UI CNK OS (Compute Node Kernel) - on 17 cores IPC Network Backing Store GPFS (General Parallel File System) System Management Software INK OS (IO Node Kernel)

Network Backing Store GPFS (General Parallel File

9 Blue Gene System Management Software Job Scheduler: LoadLeveler Resource Manager: LoadLeveler Central Manager IPC: MPICH2 File System: GPFS OS: CNK, INK

10 Job Scheduling Maximize resource usage CPU cycles, RAM, storage space, software licenses Algorithms SJF, LJF, FIFO, High Priority, etc. Considerations Job type, OS Awareness, Scalability, Efficiency, Dynamic Capability, Preemption, OS Scheduling

11 Job Scheduler: LoadLeveler Built in Blue Gene/Q job scheduler Checkpoint support Priority Queues Priority from user group FIFO within jobs of equal priority Generally nonpremptible

12 LoadLeveler: LL_DEFAULT Double Queue w/ Advanced Reservation As nodes are freed, reserve them for the next job NEGOTIATOR_PARALLEL_HOLD: Specify the amount of time a job can hold onto a resource Serial programs queued separate from parallel Issues Under utilization Jobs may never get enough resources within the time allotted

time a job can hold onto a resource Serial programs queued separate from parallel

13 LoadLeveler: BACKFILL Double Queue w/ Advanced Reservation w/ Wall Clock Limit Scheduler can determine when resources will be available Can backfill shorter jobs before large jobs Issues Priority Inversion Incorrect wall clock limit

when resources will be available Can backfill shorter

14 LoadLeveler: GANG Coordinated time multiplex scheduler Each time slice a virtual machine Issues Increased run time Context switch overhead RAM limited

15 General Parallel File System (GPFS) Blue Gene/Q default file system Parallel access to files, file metadata Design considerations: Highly parallel access Bandwidth bottleneck Huge disks and files Compute Nodes I/O Nodes I/O Network Disk Array

considerations: Highly parallel access Bandwidth bottleneck

16 GPFS Overview Striped Files Files stored in (~256K) blocks per disk Distributed in round robin fashion Massively parallel file retrieval, bandwidth limited Vulnerable to failure RAID redundancy on each disk Block File

Massively parallel file retrieval, bandwidth limited

17 GPFS: Read/Write File Parallelism in two methods Distributed lock manager Lock for byte ranges within file Lock tokens issued to I/O nodes Data Shipping RCU managed blocks within single file Metadata Parallelism One I/O node designated as metanode for file and maintains the inode information

Data Shipping RCU managed blocks within single file Metadata Parallelism

18 GPFS: Allocate/Delete Allocation Manager Block Maintain bitmap of free blocks Issue region locks File Allocation Get region lock, check for free space File Deletion Region Requires update of allocation manager Requires clearing disk space while holding region lock Delayed distributed deletion across I/O nodes based on lock ownership File

Region Requires update of allocation manager Requires clearing disk space while

19 GPFS: Disk Organization Extensible Hashing within directories Use n bits of hashing function to group files On collision, increase to n+1 and reorganize Journal file system on disk Shared journal, so any node can restore disk

files On collision, increase to n+1 and reorganize

20 Message Passing (MPI) MPI (Message Passing Interface) Standard (not a library) Implementations with compliant compilers: OpenMPI, MPICH, mpijava, pympi, etc. Superfork to all CPUs available Master Process Process Process MPI_INIT() Process MPI_INIT(), MPI_SEND(), MPI_RECV(), MPI_WAIT() (mostly) OS, Cluster Manager independent

Superfork to all CPUs available Master Process Process Process MPI_INIT() Process

21 Resource Manager Resources managers provide the low-level functionality to start, hold, cancel, and monitor jobs. Without these capabilities, a scheduler alone cannot control jobs. Daemon runs on each node on top of OS Layer of abstraction between OS and Message Passing Interface Interfaces with Job Scheduler Manages job submission, admin interface Monitors compute resources

22 HPC at U of R Blue Streak Blue Gene/Q System BG/P SLURM Blue Gene/P System LoadLeveler Resource Manager /Scheduler BlueHive Torque Resource Manager, Maui Scheduler Intel Blade Center System

23 Works Cited Barney, Blaise. Message Passing Interface (MPI). Lawrence Liverpool National Laboratory. (2012). Center for Integrated Research Computing. Resources. University of Rochester. (2012). Gilge, Megan. IBM System Blue Gene Solution: Blue Gene/Q Application Development. International Technical Support Organization. IBM. March Iqbal, Saeed, Rinku Gupta, Yung-Chin Fang. Planning Considerations for Job Scheduling in HPC Clusters. Dell Power Solutions, February Lakner, Gary and Brant Knudson. IBM System Blue Gene Solution: Blue Gene/Q System Administration. International Technical Support Organization. IBM. June Kannan, Subramanian, Mark Roberts, Peter Mayes, Dave Brelsford, Joseph F Skovira. Workload Management with LoadLeveler. International Technical Support Organization. IBM. November Schmuck, Frank and Roger Haskin. GPFS: A Shared-Disk File System for Large Computing Clusters. Proceedings of the Conference on File and Storage Technologies (FAST 02), January 2002, Monterey, CA, pp (USENIX, Berkeley, CA.)

Cluster Implementation and Management; Scheduling

Cluster Implementation and Management; Scheduling CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 1 /