Cluster Implementation and Management; Scheduling

Similar documents

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

New Storage System Solutions

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

HPC Software Requirements to Support an HPC Cluster Supercomputer

Sun Constellation System: The Open Petascale Computing Architecture

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

Cray DVS: Data Virtualization Service

Hadoop on the Gordon Data Intensive Cluster

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida

Building Clusters for Gromacs and other HPC applications

High Performance Computing OpenStack Options. September 22, 2015

THE SUN STORAGE AND ARCHIVE SOLUTION FOR HPC

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Logically a Linux cluster looks something like the following: Compute Nodes. user Head node. network

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

FLOW-3D Performance Benchmark and Profiling. September 2012

Kriterien für ein PetaFlop System

System Software for High Performance Computing. Joe Izraelevitz

Simple Introduction to Clusters

Lessons learned from parallel file system operation

POWER ALL GLOBAL FILE SYSTEM (PGFS)

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Building a Scalable Storage with InfiniBand

Lustre failover experience

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Latency Considerations for 10GBase-T PHYs

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

HPC Update: Engagement Model

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Current Status of FEFS for the K computer

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

High Performance Computing in CST STUDIO SUITE

Virtual Compute Appliance Frequently Asked Questions

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

CRIBI. Calcolo Scientifico e Bioinformatica oggi Università di Padova 13 gennaio 2012

Parallel Programming Survey

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

Scalable filesystems boosting Linux storage solutions

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Scaling from Workstation to Cluster for Compute-Intensive Applications

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Best Practices for Data Sharing in a Grid Distributed SAS Environment. Updated July 2010

Easier - Faster - Better

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Integrated Application and Data Protection. NEC ExpressCluster White Paper

GPFS und HPSS am HLRS

Data storage considerations for HTS platforms. George Magklaras -- node manager

Recommended hardware system configurations for ANSYS users

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET

A Comparison on Current Distributed File Systems for Beowulf Clusters

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems

An Oracle White Paper September Oracle Exadata Database Machine - Backup & Recovery Sizing: Tape Backups

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

- An Essential Building Block for Stable and Reliable Compute Clusters

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

CATS-i : LINUX CLUSTER ADMINISTRATION TOOLS ON THE INTERNET

How to Choose your Red Hat Enterprise Linux Filesystem

Preparation Guide. How to prepare your environment for an OnApp Cloud v3.0 (beta) deployment.

Transforming the UL into a Big Data University. Current status and planned evolutions

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Stateless Compute Cluster

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

White Paper Solarflare High-Performance Computing (HPC) Applications

Fujitsu HPC Cluster Suite

Unisys ClearPath Forward Fabric Based Platform to Power the Weather Enterprise

NetApp High-Performance Computing Solution for Lustre: Solution Guide

CMS Tier-3 cluster at NISER. Dr. Tania Moulik

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Post-production Video Editing Solution Guide with Microsoft SMB 3 File Serving AssuredSAN 4000

Designed for Maximum Accelerator Performance

Are Blade Servers Right For HEP?

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

NFS SERVER WITH 10 GIGABIT ETHERNET NETWORKS

Architecting a High Performance Storage System

Upgrading Small Business Client and Server Infrastructure E-LEET Solutions. E-LEET Solutions is an information technology consulting firm

Transcription:

Cluster Implementation and Management; Scheduling CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 1 / 29

Outline 1 Cluster components Nodes Interconnect Storage and file systems Software 2 Node provisioning, resource management, and job scheduling Provisioning nodes Resource management and job scheduling CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 2 / 29

Acknowledgements Some material used in creating these slides comes from http://en.wikipedia.org/wiki/list_of_device_bandwidths http://www.hpcwire.com/hpcwire/2010-11-18/is_10_ gigabit_ethernet_ready_for_hpc.html CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 3 / 29

Cluster components A typical cluster consists of the following components: master/login nodes (1 or more) compute nodes (many) interconnect (1 or more) storage system system software development tools runtime system CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 4 / 29

Outline 1 Cluster components Nodes Interconnect Storage and file systems Software 2 Node provisioning, resource management, and job scheduling Provisioning nodes Resource management and job scheduling CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 5 / 29

master/login nodes Master (or service) nodes run the resource manager and job scheduler login nodes handle interactive user logins, software development, submission of jobs, and pre- and post-processing of data On small clusters a single node is both the master and login node. Larger clusters have multiple master nodes for high availability (HA) and multiple separate login nodes. CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 6 / 29

Compute nodes Compute node configuration depends on applications cluster is designed to support. Important factor to consider are number of processors, number of cores per processor amount of RAM, FSB speed GPU or other accelerator, local storage,... CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 7 / 29

Outline 1 Cluster components Nodes Interconnect Storage and file systems Software 2 Node provisioning, resource management, and job scheduling Provisioning nodes Resource management and job scheduling CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 8 / 29

Interconnect The network that connects the compute nodes to each other and the master/login nodes is called an interconnect fabric or just interconnect. As in the case of compute nodes, the type of interconnect chosen depends on the applications the cluster is designed to run. Key parameters are latency and bandwidth A scalable, low-latency, high-bandwidth interconnect is desirable for the tightly coupled tasks typical in HPC. Cost of the interconnect can be a significant portion of the overall cluster cost. CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 9 / 29

Interconnect options Two main options: Ethernet or InfiniBand. Image source: http://www.hpcadvisorycouncil.com/pdf/hpcmarket_and_interconnects_report.pdf, 2009. CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 10 / 29

Ethernet Gigabit Ethernet (GigE), available since the early 2000 s, is now the Ethernet standard for general use 10-Gigabit Ethernet (10-GigE) became available in late 2000s The names refer to the supplied bandwidth; 1 Gigabit/s is 125 MB/s while 10 Gigabit/s is 1.25 GB/s. Typical GigE latency is 20 µsec. Low-latency 10-GigE latency can be around 4 to 5 µsec. In many HPC applications low latency is more important than bandwidth many short messages sent between tightly-coupled processes. Unlike fast Ethernet and GigE, 10-GigE is full-duplex and is a switched network fabric (no hubs). Still somewhat expensive: adapters $300 $600, switches $1,000 $10,000. CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 11 / 29

InfiniBand InfiniBand (IB) is a switched network fabric Very low latency, 1 to 3 µsec Bandwidth comparable to 10-GigE; InfiniBand QDR 12x bandwidth is 12MB/s New InfiniBand EDR technology is pushing 36MB/s Cost is comparable to 10-GigE but usually must be augmented by Ethernet network CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 12 / 29

Other/hybrid Medium to large clusters often have multiple network interconnects IB or 10-GigE for compute node interconnect fabric; low-latency and high bandwidth This interconnect may also connect to storage subsystem...... or a separate IB or 10-GigE network may be used for access to storage and the master/login node(s) In some clusters IB is used for interconnect groups of compute nodes and 10-GigE or even GigE is used to connect the groups of nodes to each other (compromise to reduce cost) CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 13 / 29

Outline 1 Cluster components Nodes Interconnect Storage and file systems Software 2 Node provisioning, resource management, and job scheduling Provisioning nodes Resource management and job scheduling CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 14 / 29

Storage and file systems In small clusters disks in the master/login node provide primary shared storage. Compute nodes may have disks for scratch space In larger clusters a separate storage area network (SAN) is used to provide storage to the cluster Usually a distributed file system (DFS) is used to make the make the storage network appear transparently as a disk or disks to the cluster nodes Currently Lustre is a popular DFS option; others include NFS, GPFS, and FhGFS. Desired goal: provide concurrent, high-speed access to applications executing on multiple nodes CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 15 / 29

NFS NFS stands for Network File System Developed by Sun Microsystems in the early 1980s Open source implementations exist for most systems Still in wide use; NFS v4 is current standard; performance and security enhancements over previous versions CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 16 / 29

GPFS This is IBM s General Parallel File System Used on some computers in the Top500 list and in many commercial clusters First appeared in late 1990s Distributed metadata; no single controller to eliminate bottleneck Depends on RAID for redundancy and protection from loss of data CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 17 / 29

Lustre Open source; name derived from Linux Cluster Used by Titan and 5 other of top 10 computers in the Top500 list The Lustre system has three main components: 1 A MDS (metadata server) and associated MDTs (metadata targets; one per Lustre file system) 2 One or more OSSes (object storage servers) that interact with OSTs (object storage targets disks, SAN, etc.) 3 clients: cluster nodes, workstations, archival storage systems, etc designed for high availability and scalability CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 18 / 29

Lustre Image source: http://users.nccs.gov/~fwang2/papers/lustre_report.pdf CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 19 / 29

Outline 1 Cluster components Nodes Interconnect Storage and file systems Software 2 Node provisioning, resource management, and job scheduling Provisioning nodes Resource management and job scheduling CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 20 / 29

HPC software stack One vendor s software stack diagram: CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 21 / 29

HPC software Operating system Most clusters today run some version of Linux; RedHat and CentOS (both RPM based) are most popular Some venders (e.g. Cray) have customized versions of Linux Cluster management and control provision compute nodes schedule jobs HPC development tools Compilers Debuggers and profile tools MPI libraries and runtime system CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 22 / 29

Outline 1 Cluster components Nodes Interconnect Storage and file systems Software 2 Node provisioning, resource management, and job scheduling Provisioning nodes Resource management and job scheduling CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 23 / 29

The need for diskless provisioning Original Beowulf clusters consisted of individual, stand-alone computers connected by a network Each node has a disk with the OS and other software Our workstation cluster follows this model It is untenable, however, for medium or large clusters to be configured like this, as each node would have to be installed individually software upgrades would be a huge headache The solution is to configure the nodes when they boot using a centralized system image. CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 24 / 29

PXE The PXE (Preboot execution Environment) system uses DHCP and TFTP (Trival File Transfer Protocol) to assign a network address and distribute an OS image and RAM disk to a node when it boots Nodes are not required to have disks (but may, for scratch work) Only one OS image and RAM disk need be maintained for each type of node. CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 25 / 29

Outline 1 Cluster components Nodes Interconnect Storage and file systems Software 2 Node provisioning, resource management, and job scheduling Provisioning nodes Resource management and job scheduling CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 26 / 29

Resource management A cluster resource management system provides much of the same functionality for the cluster that the OS provides for an individual system The most important resource in a cluster are the compute nodes Nodes may not all be equivalent: some may have more memory, a scratch disk, one or more accelerators (GPU, Xeon Phi), and/or share a faster interconnect with certain other nodes. The resource management system is responsible for controlling the allocation of resources to jobs on the cluster CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 27 / 29

Job scheduler The job scheduler uses information supplied by the resource manager to determine the best match between job requirements and available resources It then provides this information to the resource manager, which starts jobs as the necessary resources become available Multiple scheduling algorithms exist, including FCFS first come, first served FIFO first in, first out RR round robin SJF shortest job first LJF longest job first The algorithm chosen reflects the desired scheduling policy CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 28 / 29

Fair share Job schedulers often make adjustments to rigid scheduling decisions based on use history For example, during daytime hours a SJF policy may be enforced, giving preference to jobs with quick turn-around time Suppose Susan keeps submitting jobs that take 10 minutes to run but Bob needs to run a 15 minute job. using strict SJF, Susan s jobs will always run before Bob s If the scheduler keeps tracks of the number of jobs run for each user, it would eventually decide that Susan has had more than her fair share of the cluster nodes and Bob s job would be run. CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 29 / 29