Comet - High performance virtual clusters to support the long-tail of science. Philip M. Papadopoulos, Ph.D. San Diego Supercomputer Center California Institute for telecommunications and Information Technologies (Calit2) University of California, San Diego
HPC for the 99%
Comet is funded by U.S. National Science Foundation expand the use of high end resources to a much larger and more diverse community support the entire spectrum of NSF communities... promote a more comprehensive and balanced portfolio include research communities that are not users of traditional HPC systems. The long tail of science needs HPC
Cumulative Usage Fraction of All Jobs Charged in 2012 Millions of XD SUs Charged Jobs and SUs at various scales across NSF resources 99% of jobs run on NSF s HPC resources in 2012 used < 2048 cores 100% 90% 80% 70% 60% 3000 2500 2000 And consumed ~50% of the total core-hours across NSF resources 50% 40% 30% 20% 10% 0% One node Percentage of Jobs (Left Axis) SUs Charged (Right Axis) 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K Job Size (Cores) Job Size (Cores) 1500 1000 500 0
Comet Will Serve the 99%
Comet: System Characteristics Available 1Q 2015 Total flops ~1.9 PF (AVX2) Dell primary integrator Intel next-gen processors, former codename Haswell, with AVX2 Aeon storage vendor Mellanox FDR InfiniBand Standard compute nodes Dual 12-core Haswell processors 128 GB DDR4 DRAM (64 GB/socket!) 320 GB SSD (local scratch, VMs) GPU nodes Four NVIDIA GPUs/node Large-memory nodes (2Q 2015) 1.5 TB DRAM Four Haswell processors/node Hybrid fat-tree topology FDR (56 Gbps) InfiniBand Rack-level (72 nodes) full bisection bandwidth 4:1 oversubscription inter-rack Performance Storage 7 PB, 200 GB/s Scratch & Persistent Storage Durable Storage (reliability) 6 PB, 100 GB/s Gateway hosting nodes and VM image repository 100 Gbps external connectivity to Internet2 & ESNet
Comet Architecture Node-Local 72 HSWL Storage 320 GB 18 N racks 72 HSWL 320 GB 72 HSWL N GPU 4 Large- Memory 7x 36-port FDR in each rack wired as full fat-tree. 4:1 over subscription between racks. 18 FDR FDR 36p FDR 36p 18 Mid-tier Performance Storage 7 PB, 200 GB/s FDR 72 FDR 40GbE IB Core (2x) 72 Bridge (4x) 72 Arista 40GbE (2x) 64 128 40GbE 10GbE Durable Storage 6 PB, 100 GB/s R&E Network Access Data Movers Internet 2 Juniper 100 Gbps Arista 40GbE (2x) Data Mover (4x) Additional Support Components (not shown for clarity) NFS Servers, Virtual Image Repository, Gateway/Portal Hosting Nodes, Login Nodes, Ethernet Management Network, Rocks Management Nodes
Design Decision - Network Each Rack 72 nodes (144 CPUs, 1728 Cores) Fully-connected FDR IB (2-level Clos-Topology) 144 In-rack Cables 4:1 oversubscription between racks 18 inter-rack cables Supports the large majority of jobs with no performance degradation 3-level network for complete cluster worst-case latency similar to a much smaller cluster Reduced cost.
SSDs building on Gordon success Based on our experiences with Gordon, a number of applications will benefits from continued access to flash Applications that generate large numbers of temp files Computational finance analysis of multiple markets (NASDAQ, etc.) Text analytics word correlations in Google Ngram data Computational chemistry codes that write one- and twoelectron integral files to scratch Structural mechanics codes (e.g. Abaqus), which generate stiffness matrices that don t fit into memory
Large memory nodes While most user applications will run well on the standard compute nodes, a few domains will benefit from the large memory (1.5 TB nodes) De novo genome assembly: ALLPATHS-LG, SOAPdenovo, Velvet Finite-element calculations: Abaqus Visualization of large data sets
GPU nodes Comet s GPU nodes will serve a number of domains Molecular dynamics applications have been one of the biggest GPU success stories. Packages include Amber, CHARMM, Gromacs and NAMD Applications that depend heavily on linear algebra Image and signal processing
Key Comet Strategies Target modest-scale users and new users/communities: goal of 10,000 users/year! Support capacity computing, with a system optimized for small/modest-scale jobs and quicker resource response using allocation/scheduling policies Build upon and expand efforts with Science Gateways, encouraging gateway usage and hosting via software and operating policies Provide a virtualized environment to support development of customized software stacks, virtual environments, and project control of workspaces
Comet will serve a large number of users, including new communities/disciplines Allocations/scheduling policies to optimize for high throughput of many modest-scale jobs (leveraging Trestles experience) Optimized for rack-level jobs but cross-rack jobs feasible Optimized for throughput (ala Trestles) Per-project allocations caps to ensure large numbers of users Rapid access for start-ups with one-day account generation Limits on job sizes, with possibility of exceptions Gateway-friendly environment: Science gateways reach large communities w/ easy user access e.g. CIPRES gateway alone currently accounts for ~25% of all users of NSF resources, with 3,000 new users/year and ~5,000 users/year Virtualization provides low barriers to entry (see later charts)
Changing the face of XSEDE HPC users System design and policies Allocations, scheduling and security policies which favor gateways Support gateway middleware and gateway hosting machines Customized environments with high-performance virtualization Flexible allocations for bursty usage patterns Shared node runs for small jobs, user-settable reservations Third party apps Leverage and augment investments elsewhere FutureGrid experience, image packaging, training, on-ramp XSEDE (ECSS NIP & Gateways, TEOS, Campus Champions) Build off established successes supporting new communities Example-based documentation in Comet focus areas Unique HPC University contributions to enable community growth
Virtualization Environment Leveraging expertise of Indiana U/ FutureGrid team VM jobs scheduled just like batch jobs (not conventional cloud environment with immediate elastic access) VMs will be easy on-ramp for new users/communities, including low porting time Flexible software environments for new communities and apps VM repository/library Virtual HPC cluster (multi-node) with near-native IB latency and minimal overhead (SRIOV)
Single Root I/O Virtualization in HPC Problem: complex workflows demand increasing flexibility from HPC platforms Pro: Virtualization flexibility Con: Virtualization IO performance loss (e.g., excessive DMA interrupts) Solution: SR-IOV and Mellanox ConnectX-3 InfiniBand HCAs One physical function (PF) multiple virtual functions (VF), each with own DMA streams, memory space, interrupts Allows DMA to bypass hypervisor to VMs
More on Single Root IO Virtualization PCIe is the I/O bus of modern x86 servers The I/O controller is integrated into microprocessors The I/O complex is called single-root if there is only one controller for the bus I/O Virtualization Allow a single I/O device (e.g. Network, Disk Controller) to appear as multiple independent I/O devices These are called virtual functions Each virtual function can be independently controlled
Benchmark comparisons of SR-IOV Cluster v AWS (pre-haswell) Hardware/Software Configuration Native, SR-IOV Platform Rocks 6.1 (EL6) Virtualization via kvm CPUs 2x Xeon E5-2660 (2.2GHz) 16 cores per node Amazon EC2 Amazon Linux 2013.03 (EL6) cc2.8xlarge Instances 2x Xeon E5-2670 (2.6GHz) 16 cores per node RAM 64 GB DDR3 DRAM 60.5 DDR3 DRAM Interconnect QDR4X InfiniBand Mellanox ConnectX-3 (MT27500) Intel VT-d, SR-IOV enabled in firmware, kernel, drivers mlx4_core 1.1 Mellanox OFED 2.0 HCA firmware 2.11.1192 10 GbE common placement group
SRIOV Latency approaches Native HW SR-IOV < 30% overhead for Messages < 128 bytes < 10% overhead for eager send/recv Overhead 0% for bandwidth-limited regime Amazon EC2 > 5000% worse latency Time dependent (noisy) 19 OSU Microbenchmarks (3.9, osu_latency)
Bandwidth Unimpaired Native vs. SRIOV SR-IOV < 2% bandwidth loss over entire range > 95% peak bandwidth Amazon EC2 < 35% peak bandwidth 900% to 2500% worse bandwidth than virtualized InfiniBand 20 OSU Microbenchmarks (3.9, osu_bw)
Weather Modeling 15% Overhead 96-core (6-node) calculation Nearest-neighbor communication Scalable algorithms SR-IOV incurs modest (15%) performance hit...but still still 20% faster *** than Amazon WRF 3.4.1 3hr forecast *** 20% faster despite SR-IOV cluster having 20% slower CPUs
Quantum ESPRESSO: 28% overhead 48-core (3 node) calculation CG matrix inversion (irregular comm.) 3D FFT matrix transposes (All-to-all communication) 28% slower w/ SR-IOV SR-IOV still > 500% faster *** than EC2 Quantum Espresso 5.0.2 DEISA AUSURF112 benchmark *** 20% faster despite SR-IOV cluster having 20% slower CPUs
If bandwidth unimpaired why Falloff for App performance Latency. Measured microbenchmark for latency shows 10-30% overhead. However, this is variable. Can be as much as 100%. Why? Hypervisor/Physical node scheduling. Expect advances in software to improve this.
SR-IOV is a huge step forward in highperformance virtualization Shows substantial improvement in latency over Amazon EC2, and it provides nearly zero bandwidth overhead Benchmark application performance confirms significant improvement over EC2 SR-IOV lowers performance barrier to virtualizing the interconnect and makes fully virtualized HPC clusters viable Comet will deliver virtualized HPC to new/non-traditional communities that need flexibility without major loss of performance
High-Performance Virtualization on Comet Mellanox FDR InfiniBand HCAs with SR-IOV Rocks and KVM to manage Virtual Machines and Clusters Flexibility to support complex science gateways and web-based workflow engines Custom compute appliances and virtual clusters developed with FutureGrid and their existing expertise
Virtual Clusters Overlay Physical Cluster with User-Owned High Performance Clusters Virtual Cluster 1 Virtual Cluster 2
Virtual Cluster Characteristics User-Owned and Defined Looks like bare metal cluster to user Low overhead (latency and bandwidth) for a virtualized Infiniband interface Single Root IO Virtualization Schedule compatibility with standard HPC batch jobs A node of virtual cluster is a virtual machine. That VM looks like one element of a parallel job to the scheduler Persistence of Disk State of virtual nodes across multiple boot sequences
Some Interesting Logistics for Virtual Clusters Scheduling Can we co-schedule virtual cluster nodes with regular HPC jobs? How do we efficiently handle disk images the make up the nodes of the virtual cluster?
Managing the Physical Infrastructure
Virtual frontend container Virtual Cluster Anatomy Private network segmentation: 10GB using VLAN Infiniband PKEY Generic compute nodes Public network Virtual Frontend Private network 10GB VLAN MMM PKEY LLLL VLAN NNN PKEY JJJJ VLAN MMM PKEY LLLL Virtual Compute Virtual Compute Virtual Frontend VLAN NNN PKEY JJJJ Infiniband VLAN NNN PKEY JJJJ Virtual Compute
VM Disk Management Each VM gets a 36 GB disk (Small SCSI) Disk images are persistent through reboots Two central NASes store all disk images VMs can be allocated on all compute nodes dependent on availability (scheduler) Two solutions: o o iscsi (Network mounted disk) Disk replication on nodes
VM Disk Management iscsi NAS This is what OpenStack Supports Big Issue: Bandwidth Bottleneck at NAS Compute nodes Targets Virtual computex iqn.2001-04.com.nas-0-0-vm-compute-x
A hybrid solution via replication Initial boot of any cluster node uses an iscsi disk(call this a node disk) on the NAS During normal operation, Comet moves a node disk to the physical host that is running the node VM. And then disconnects from the NAS o All Node disk operation is local to the physical host o Fundamentally enables scale out w/o a $1M NAS At Shutdown, any changes made to the node disk (now on the physical host) are migrated back to the NAS, ready for next boot
1.a Init Disk NAS iscsi mount on NAS enables virtual compute node to boot immediately. Read operations from NAS Write operations to local disk Compute nodes Targets Virtual compute-x iqn.2001-04.com.nas-0-0-vm-compute-x Replicate Disk
NAS 1.b Init Disk During boot, the disk image on the NAS is migrated to the physical host. Read-only and read/write are then merged into one local disk iscsi mount is disconnected Compute nodes Targets Virtual compute-x
NAS 2. Steady State During normal operation Node disk is snapshot Incremental snapshots sent to NAS (replicate back to NAS) Timing/load/experiment will tell us how often we can do this Compute nodes Targets Virtual compute-x
3. Release Disk NAS Compute nodes Targets Power off Virtual compute-x At shutdown, any unsynched changes are send back to NAS When the last snapshot is sent, the Virtual compute node can be rebooted on another system
Software to implement is under development Rocks Roll so that it can be a part of any physical Rocks-defined cluster https://github.com/rocksclusters/imgstorage-roll Uses ZFS for disk image storage on NAS and hosting nodes http://zfsonlinux.org RabbitMQ AMQP (Asynchronous Message Q Protocol) http://www.rabbitmq.com/ Pika - library for communication with RabbitMQ from Python
Full Virtualization isn t the only choice: Containers Comet supports fully-virtualized clusters OS of cluster can be almost anything: Windows, Linux, Solaris, (not Mac-OS) Containers are a different way to virtualize the file system and some other elements Network, inter-process communication In Solaris for more than a decade Newly popular in Linux with the docker project Containers must run the same kernel as the host operating system Networking and device support not as flexible (yet) Containers have much smaller software footprints
Full vs. Container Virtualization Full Virtualization (KVM) Container Virtualization (Docker) Physical Host Kernel HW Virtual Host Virtual Host Physical Network Fully-Virtualized have independent kernels/os. Hardware is universal. Memory/CPU defined by def of virtual HW Physical Host Kernel partial /proc Network Bridge Container 1 Container 2 Physical Network Containers Inherits the Host kernel and elements of its hardware. Need cgroups to limit container memory /cpu usage
Still need to Configure Virtual Systems Why Docker (container) virtualization popular Very space efficient if the changes to the base OS file system are small. Changes can be 100 s of megabytes instead of Gbytes Network (latency) performance and/or network topology is less important (topology needed for virtual clusters) Given how quickly (and relatively lightweight) Docker brings up virtual environments, this will be addressed A system is DEFINED by the contents of the file system System libraries, application code, configuration Docker and Full-virtualization need to be configured (No Free Lunch)
Running: Almost the same ~ Rocks-created server running in Docker Container Notice: uptime, processor and date created rocks-created server running fully-virtualized (KVM) Notice: # cpus, uptime
Summary Comet: HPC for the 99% Expand capability to enable virtual clusters Not generic cloud computing Advances in virtualization techniques continue Comet will support fully-virtualized clusters Container-based virtualized (Docker) become quite popular