Cluster Implementation and Management; Scheduling CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 1 / 29
Outline 1 Cluster components Nodes Interconnect Storage and file systems Software 2 Node provisioning, resource management, and job scheduling Provisioning nodes Resource management and job scheduling CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 2 / 29
Acknowledgements Some material used in creating these slides comes from http://en.wikipedia.org/wiki/list_of_device_bandwidths http://www.hpcwire.com/hpcwire/2010-11-18/is_10_ gigabit_ethernet_ready_for_hpc.html CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 3 / 29
Cluster components A typical cluster consists of the following components: master/login nodes (1 or more) compute nodes (many) interconnect (1 or more) storage system system software development tools runtime system CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 4 / 29
Outline 1 Cluster components Nodes Interconnect Storage and file systems Software 2 Node provisioning, resource management, and job scheduling Provisioning nodes Resource management and job scheduling CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 5 / 29
master/login nodes Master (or service) nodes run the resource manager and job scheduler login nodes handle interactive user logins, software development, submission of jobs, and pre- and post-processing of data On small clusters a single node is both the master and login node. Larger clusters have multiple master nodes for high availability (HA) and multiple separate login nodes. CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 6 / 29
Compute nodes Compute node configuration depends on applications cluster is designed to support. Important factor to consider are number of processors, number of cores per processor amount of RAM, FSB speed GPU or other accelerator, local storage,... CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 7 / 29
Outline 1 Cluster components Nodes Interconnect Storage and file systems Software 2 Node provisioning, resource management, and job scheduling Provisioning nodes Resource management and job scheduling CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 8 / 29
Interconnect The network that connects the compute nodes to each other and the master/login nodes is called an interconnect fabric or just interconnect. As in the case of compute nodes, the type of interconnect chosen depends on the applications the cluster is designed to run. Key parameters are latency and bandwidth A scalable, low-latency, high-bandwidth interconnect is desirable for the tightly coupled tasks typical in HPC. Cost of the interconnect can be a significant portion of the overall cluster cost. CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 9 / 29
Interconnect options Two main options: Ethernet or InfiniBand. Image source: http://www.hpcadvisorycouncil.com/pdf/hpcmarket_and_interconnects_report.pdf, 2009. CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 10 / 29
Ethernet Gigabit Ethernet (GigE), available since the early 2000 s, is now the Ethernet standard for general use 10-Gigabit Ethernet (10-GigE) became available in late 2000s The names refer to the supplied bandwidth; 1 Gigabit/s is 125 MB/s while 10 Gigabit/s is 1.25 GB/s. Typical GigE latency is 20 µsec. Low-latency 10-GigE latency can be around 4 to 5 µsec. In many HPC applications low latency is more important than bandwidth many short messages sent between tightly-coupled processes. Unlike fast Ethernet and GigE, 10-GigE is full-duplex and is a switched network fabric (no hubs). Still somewhat expensive: adapters $300 $600, switches $1,000 $10,000. CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 11 / 29
InfiniBand InfiniBand (IB) is a switched network fabric Very low latency, 1 to 3 µsec Bandwidth comparable to 10-GigE; InfiniBand QDR 12x bandwidth is 12MB/s New InfiniBand EDR technology is pushing 36MB/s Cost is comparable to 10-GigE but usually must be augmented by Ethernet network CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 12 / 29
Other/hybrid Medium to large clusters often have multiple network interconnects IB or 10-GigE for compute node interconnect fabric; low-latency and high bandwidth This interconnect may also connect to storage subsystem...... or a separate IB or 10-GigE network may be used for access to storage and the master/login node(s) In some clusters IB is used for interconnect groups of compute nodes and 10-GigE or even GigE is used to connect the groups of nodes to each other (compromise to reduce cost) CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 13 / 29
Outline 1 Cluster components Nodes Interconnect Storage and file systems Software 2 Node provisioning, resource management, and job scheduling Provisioning nodes Resource management and job scheduling CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 14 / 29
Storage and file systems In small clusters disks in the master/login node provide primary shared storage. Compute nodes may have disks for scratch space In larger clusters a separate storage area network (SAN) is used to provide storage to the cluster Usually a distributed file system (DFS) is used to make the make the storage network appear transparently as a disk or disks to the cluster nodes Currently Lustre is a popular DFS option; others include NFS, GPFS, and FhGFS. Desired goal: provide concurrent, high-speed access to applications executing on multiple nodes CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 15 / 29
NFS NFS stands for Network File System Developed by Sun Microsystems in the early 1980s Open source implementations exist for most systems Still in wide use; NFS v4 is current standard; performance and security enhancements over previous versions CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 16 / 29
GPFS This is IBM s General Parallel File System Used on some computers in the Top500 list and in many commercial clusters First appeared in late 1990s Distributed metadata; no single controller to eliminate bottleneck Depends on RAID for redundancy and protection from loss of data CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 17 / 29
Lustre Open source; name derived from Linux Cluster Used by Titan and 5 other of top 10 computers in the Top500 list The Lustre system has three main components: 1 A MDS (metadata server) and associated MDTs (metadata targets; one per Lustre file system) 2 One or more OSSes (object storage servers) that interact with OSTs (object storage targets disks, SAN, etc.) 3 clients: cluster nodes, workstations, archival storage systems, etc designed for high availability and scalability CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 18 / 29
Lustre Image source: http://users.nccs.gov/~fwang2/papers/lustre_report.pdf CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 19 / 29
Outline 1 Cluster components Nodes Interconnect Storage and file systems Software 2 Node provisioning, resource management, and job scheduling Provisioning nodes Resource management and job scheduling CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 20 / 29
HPC software stack One vendor s software stack diagram: CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 21 / 29
HPC software Operating system Most clusters today run some version of Linux; RedHat and CentOS (both RPM based) are most popular Some venders (e.g. Cray) have customized versions of Linux Cluster management and control provision compute nodes schedule jobs HPC development tools Compilers Debuggers and profile tools MPI libraries and runtime system CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 22 / 29
Outline 1 Cluster components Nodes Interconnect Storage and file systems Software 2 Node provisioning, resource management, and job scheduling Provisioning nodes Resource management and job scheduling CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 23 / 29
The need for diskless provisioning Original Beowulf clusters consisted of individual, stand-alone computers connected by a network Each node has a disk with the OS and other software Our workstation cluster follows this model It is untenable, however, for medium or large clusters to be configured like this, as each node would have to be installed individually software upgrades would be a huge headache The solution is to configure the nodes when they boot using a centralized system image. CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 24 / 29
PXE The PXE (Preboot execution Environment) system uses DHCP and TFTP (Trival File Transfer Protocol) to assign a network address and distribute an OS image and RAM disk to a node when it boots Nodes are not required to have disks (but may, for scratch work) Only one OS image and RAM disk need be maintained for each type of node. CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 25 / 29
Outline 1 Cluster components Nodes Interconnect Storage and file systems Software 2 Node provisioning, resource management, and job scheduling Provisioning nodes Resource management and job scheduling CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 26 / 29
Resource management A cluster resource management system provides much of the same functionality for the cluster that the OS provides for an individual system The most important resource in a cluster are the compute nodes Nodes may not all be equivalent: some may have more memory, a scratch disk, one or more accelerators (GPU, Xeon Phi), and/or share a faster interconnect with certain other nodes. The resource management system is responsible for controlling the allocation of resources to jobs on the cluster CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 27 / 29
Job scheduler The job scheduler uses information supplied by the resource manager to determine the best match between job requirements and available resources It then provides this information to the resource manager, which starts jobs as the necessary resources become available Multiple scheduling algorithms exist, including FCFS first come, first served FIFO first in, first out RR round robin SJF shortest job first LJF longest job first The algorithm chosen reflects the desired scheduling policy CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 28 / 29
Fair share Job schedulers often make adjustments to rigid scheduling decisions based on use history For example, during daytime hours a SJF policy may be enforced, giving preference to jobs with quick turn-around time Suppose Susan keeps submitting jobs that take 10 minutes to run but Bob needs to run a 15 minute job. using strict SJF, Susan s jobs will always run before Bob s If the scheduler keeps tracks of the number of jobs run for each user, it would eventually decide that Susan has had more than her fair share of the cluster nodes and Bob s job would be run. CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 29 / 29