Comet - High performance virtual clusters to support the long-tail of science.



Similar documents
SR-IOV: Performance Benefits for Virtualized Interconnects!

Hadoop on the Gordon Data Intensive Cluster

SMB Direct for SQL Server and Private Cloud

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

State of the Art Cloud Infrastructure

Perspec'ves on SDN. Roadmap to SDN Workshop, LBL

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Cloud Computing. Alex Crawford Ben Johnstone

Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

Enabling Technologies for Distributed and Cloud Computing

Enabling Technologies for Distributed Computing

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Scientific Computing Data Management Visions

Realizing the next step in storage/converged architectures

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Zadara Storage Cloud A

Accelerating I/O- Intensive Applications in IT Infrastructure with Innodisk FlexiArray Flash Appliance. Alex Ho, Product Manager Innodisk Corporation

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Sun Constellation System: The Open Petascale Computing Architecture

Overview: X5 Generation Database Machines

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

SURFsara HPC Cloud Workshop

SR-IOV In High Performance Computing

Mellanox Academy Online Training (E-learning)

Enabling High performance Big Data platform with RDMA

CON9577 Performance Optimizations for Cloud Infrastructure as a Service

Cloud Computing through Virtualization and HPC technologies

Solid State Storage in Massive Data Environments Erik Eyberg

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

FLOW-3D Performance Benchmark and Profiling. September 2012

Open Cirrus: Towards an Open Source Cloud Stack

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

Virtual Compute Appliance Frequently Asked Questions

Can High-Performance Interconnects Benefit Memcached and Hadoop?

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

LS DYNA Performance Benchmarks and Profiling. January 2009

High Performance Computing in CST STUDIO SUITE

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Boas Betzler. Planet. Globally Distributed IaaS Platform Examples AWS and SoftLayer. November 9, IBM Corporation

Configuration Maximums

MaxDeploy Hyper- Converged Reference Architecture Solution Brief

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

Building a Scalable Storage with InfiniBand

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

Optimizing Web Infrastructure on Intel Architecture

Datacenter Operating Systems

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Oracle Maximum Availability Architecture with Exadata Database Machine. Morana Kobal Butković Principal Sales Consultant Oracle Hrvatska

Hadoop: Embracing future hardware

New Data Center architecture

Deep Dive on SimpliVity s OmniStack A Technical Whitepaper

Increasing Flash Throughput for Big Data Applications (Data Management Track)

Microsoft Windows Server Hyper-V in a Flash

The Future of Computing Cisco Unified Computing System. Markus Kunstmann Channels Systems Engineer

SQL Server Virtualization

Masters Project Proposal

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

SURFsara HPC Cloud Workshop

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Performance Beyond PCI Express: Moving Storage to The Memory Bus A Technical Whitepaper

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

RED HAT ENTERPRISE VIRTUALIZATION

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Virtual Machine Monitors. Dr. Marc E. Fiuczynski Research Scholar Princeton University

Building Clusters for Gromacs and other HPC applications

Storage Architectures. Ron Emerick, Oracle Corporation

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

Data Center Op+miza+on

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Dell Compellent Storage Center SAN & VMware View 1,000 Desktop Reference Architecture. Dell Compellent Product Specialist Team

Ceph Optimization on All Flash Storage

Mellanox Accelerated Storage Solutions

UCS M-Series Modular Servers

When Does Colocation Become Competitive With The Public Cloud? WHITE PAPER SEPTEMBER 2014

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

Full and Para Virtualization

StorPool Distributed Storage Software Technical Overview

Cloud Computing and the Internet. Conferenza GARR 2010

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

When Does Colocation Become Competitive With The Public Cloud?

Interoperability Testing and iwarp Performance. Whitepaper

FlashSoft Software from SanDisk : Accelerating Virtual Infrastructures

Integrated Grid Solutions. and Greenplum

Transcription:

Comet - High performance virtual clusters to support the long-tail of science. Philip M. Papadopoulos, Ph.D. San Diego Supercomputer Center California Institute for telecommunications and Information Technologies (Calit2) University of California, San Diego

HPC for the 99%

Comet is funded by U.S. National Science Foundation expand the use of high end resources to a much larger and more diverse community support the entire spectrum of NSF communities... promote a more comprehensive and balanced portfolio include research communities that are not users of traditional HPC systems. The long tail of science needs HPC

Cumulative Usage Fraction of All Jobs Charged in 2012 Millions of XD SUs Charged Jobs and SUs at various scales across NSF resources 99% of jobs run on NSF s HPC resources in 2012 used < 2048 cores 100% 90% 80% 70% 60% 3000 2500 2000 And consumed ~50% of the total core-hours across NSF resources 50% 40% 30% 20% 10% 0% One node Percentage of Jobs (Left Axis) SUs Charged (Right Axis) 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K Job Size (Cores) Job Size (Cores) 1500 1000 500 0

Comet Will Serve the 99%

Comet: System Characteristics Available 1Q 2015 Total flops ~1.9 PF (AVX2) Dell primary integrator Intel next-gen processors, former codename Haswell, with AVX2 Aeon storage vendor Mellanox FDR InfiniBand Standard compute nodes Dual 12-core Haswell processors 128 GB DDR4 DRAM (64 GB/socket!) 320 GB SSD (local scratch, VMs) GPU nodes Four NVIDIA GPUs/node Large-memory nodes (2Q 2015) 1.5 TB DRAM Four Haswell processors/node Hybrid fat-tree topology FDR (56 Gbps) InfiniBand Rack-level (72 nodes) full bisection bandwidth 4:1 oversubscription inter-rack Performance Storage 7 PB, 200 GB/s Scratch & Persistent Storage Durable Storage (reliability) 6 PB, 100 GB/s Gateway hosting nodes and VM image repository 100 Gbps external connectivity to Internet2 & ESNet

Comet Architecture Node-Local 72 HSWL Storage 320 GB 18 N racks 72 HSWL 320 GB 72 HSWL N GPU 4 Large- Memory 7x 36-port FDR in each rack wired as full fat-tree. 4:1 over subscription between racks. 18 FDR FDR 36p FDR 36p 18 Mid-tier Performance Storage 7 PB, 200 GB/s FDR 72 FDR 40GbE IB Core (2x) 72 Bridge (4x) 72 Arista 40GbE (2x) 64 128 40GbE 10GbE Durable Storage 6 PB, 100 GB/s R&E Network Access Data Movers Internet 2 Juniper 100 Gbps Arista 40GbE (2x) Data Mover (4x) Additional Support Components (not shown for clarity) NFS Servers, Virtual Image Repository, Gateway/Portal Hosting Nodes, Login Nodes, Ethernet Management Network, Rocks Management Nodes

Design Decision - Network Each Rack 72 nodes (144 CPUs, 1728 Cores) Fully-connected FDR IB (2-level Clos-Topology) 144 In-rack Cables 4:1 oversubscription between racks 18 inter-rack cables Supports the large majority of jobs with no performance degradation 3-level network for complete cluster worst-case latency similar to a much smaller cluster Reduced cost.

SSDs building on Gordon success Based on our experiences with Gordon, a number of applications will benefits from continued access to flash Applications that generate large numbers of temp files Computational finance analysis of multiple markets (NASDAQ, etc.) Text analytics word correlations in Google Ngram data Computational chemistry codes that write one- and twoelectron integral files to scratch Structural mechanics codes (e.g. Abaqus), which generate stiffness matrices that don t fit into memory

Large memory nodes While most user applications will run well on the standard compute nodes, a few domains will benefit from the large memory (1.5 TB nodes) De novo genome assembly: ALLPATHS-LG, SOAPdenovo, Velvet Finite-element calculations: Abaqus Visualization of large data sets

GPU nodes Comet s GPU nodes will serve a number of domains Molecular dynamics applications have been one of the biggest GPU success stories. Packages include Amber, CHARMM, Gromacs and NAMD Applications that depend heavily on linear algebra Image and signal processing

Key Comet Strategies Target modest-scale users and new users/communities: goal of 10,000 users/year! Support capacity computing, with a system optimized for small/modest-scale jobs and quicker resource response using allocation/scheduling policies Build upon and expand efforts with Science Gateways, encouraging gateway usage and hosting via software and operating policies Provide a virtualized environment to support development of customized software stacks, virtual environments, and project control of workspaces

Comet will serve a large number of users, including new communities/disciplines Allocations/scheduling policies to optimize for high throughput of many modest-scale jobs (leveraging Trestles experience) Optimized for rack-level jobs but cross-rack jobs feasible Optimized for throughput (ala Trestles) Per-project allocations caps to ensure large numbers of users Rapid access for start-ups with one-day account generation Limits on job sizes, with possibility of exceptions Gateway-friendly environment: Science gateways reach large communities w/ easy user access e.g. CIPRES gateway alone currently accounts for ~25% of all users of NSF resources, with 3,000 new users/year and ~5,000 users/year Virtualization provides low barriers to entry (see later charts)

Changing the face of XSEDE HPC users System design and policies Allocations, scheduling and security policies which favor gateways Support gateway middleware and gateway hosting machines Customized environments with high-performance virtualization Flexible allocations for bursty usage patterns Shared node runs for small jobs, user-settable reservations Third party apps Leverage and augment investments elsewhere FutureGrid experience, image packaging, training, on-ramp XSEDE (ECSS NIP & Gateways, TEOS, Campus Champions) Build off established successes supporting new communities Example-based documentation in Comet focus areas Unique HPC University contributions to enable community growth

Virtualization Environment Leveraging expertise of Indiana U/ FutureGrid team VM jobs scheduled just like batch jobs (not conventional cloud environment with immediate elastic access) VMs will be easy on-ramp for new users/communities, including low porting time Flexible software environments for new communities and apps VM repository/library Virtual HPC cluster (multi-node) with near-native IB latency and minimal overhead (SRIOV)

Single Root I/O Virtualization in HPC Problem: complex workflows demand increasing flexibility from HPC platforms Pro: Virtualization flexibility Con: Virtualization IO performance loss (e.g., excessive DMA interrupts) Solution: SR-IOV and Mellanox ConnectX-3 InfiniBand HCAs One physical function (PF) multiple virtual functions (VF), each with own DMA streams, memory space, interrupts Allows DMA to bypass hypervisor to VMs

More on Single Root IO Virtualization PCIe is the I/O bus of modern x86 servers The I/O controller is integrated into microprocessors The I/O complex is called single-root if there is only one controller for the bus I/O Virtualization Allow a single I/O device (e.g. Network, Disk Controller) to appear as multiple independent I/O devices These are called virtual functions Each virtual function can be independently controlled

Benchmark comparisons of SR-IOV Cluster v AWS (pre-haswell) Hardware/Software Configuration Native, SR-IOV Platform Rocks 6.1 (EL6) Virtualization via kvm CPUs 2x Xeon E5-2660 (2.2GHz) 16 cores per node Amazon EC2 Amazon Linux 2013.03 (EL6) cc2.8xlarge Instances 2x Xeon E5-2670 (2.6GHz) 16 cores per node RAM 64 GB DDR3 DRAM 60.5 DDR3 DRAM Interconnect QDR4X InfiniBand Mellanox ConnectX-3 (MT27500) Intel VT-d, SR-IOV enabled in firmware, kernel, drivers mlx4_core 1.1 Mellanox OFED 2.0 HCA firmware 2.11.1192 10 GbE common placement group

SRIOV Latency approaches Native HW SR-IOV < 30% overhead for Messages < 128 bytes < 10% overhead for eager send/recv Overhead 0% for bandwidth-limited regime Amazon EC2 > 5000% worse latency Time dependent (noisy) 19 OSU Microbenchmarks (3.9, osu_latency)

Bandwidth Unimpaired Native vs. SRIOV SR-IOV < 2% bandwidth loss over entire range > 95% peak bandwidth Amazon EC2 < 35% peak bandwidth 900% to 2500% worse bandwidth than virtualized InfiniBand 20 OSU Microbenchmarks (3.9, osu_bw)

Weather Modeling 15% Overhead 96-core (6-node) calculation Nearest-neighbor communication Scalable algorithms SR-IOV incurs modest (15%) performance hit...but still still 20% faster *** than Amazon WRF 3.4.1 3hr forecast *** 20% faster despite SR-IOV cluster having 20% slower CPUs

Quantum ESPRESSO: 28% overhead 48-core (3 node) calculation CG matrix inversion (irregular comm.) 3D FFT matrix transposes (All-to-all communication) 28% slower w/ SR-IOV SR-IOV still > 500% faster *** than EC2 Quantum Espresso 5.0.2 DEISA AUSURF112 benchmark *** 20% faster despite SR-IOV cluster having 20% slower CPUs

If bandwidth unimpaired why Falloff for App performance Latency. Measured microbenchmark for latency shows 10-30% overhead. However, this is variable. Can be as much as 100%. Why? Hypervisor/Physical node scheduling. Expect advances in software to improve this.

SR-IOV is a huge step forward in highperformance virtualization Shows substantial improvement in latency over Amazon EC2, and it provides nearly zero bandwidth overhead Benchmark application performance confirms significant improvement over EC2 SR-IOV lowers performance barrier to virtualizing the interconnect and makes fully virtualized HPC clusters viable Comet will deliver virtualized HPC to new/non-traditional communities that need flexibility without major loss of performance

High-Performance Virtualization on Comet Mellanox FDR InfiniBand HCAs with SR-IOV Rocks and KVM to manage Virtual Machines and Clusters Flexibility to support complex science gateways and web-based workflow engines Custom compute appliances and virtual clusters developed with FutureGrid and their existing expertise

Virtual Clusters Overlay Physical Cluster with User-Owned High Performance Clusters Virtual Cluster 1 Virtual Cluster 2

Virtual Cluster Characteristics User-Owned and Defined Looks like bare metal cluster to user Low overhead (latency and bandwidth) for a virtualized Infiniband interface Single Root IO Virtualization Schedule compatibility with standard HPC batch jobs A node of virtual cluster is a virtual machine. That VM looks like one element of a parallel job to the scheduler Persistence of Disk State of virtual nodes across multiple boot sequences

Some Interesting Logistics for Virtual Clusters Scheduling Can we co-schedule virtual cluster nodes with regular HPC jobs? How do we efficiently handle disk images the make up the nodes of the virtual cluster?

Managing the Physical Infrastructure

Virtual frontend container Virtual Cluster Anatomy Private network segmentation: 10GB using VLAN Infiniband PKEY Generic compute nodes Public network Virtual Frontend Private network 10GB VLAN MMM PKEY LLLL VLAN NNN PKEY JJJJ VLAN MMM PKEY LLLL Virtual Compute Virtual Compute Virtual Frontend VLAN NNN PKEY JJJJ Infiniband VLAN NNN PKEY JJJJ Virtual Compute

VM Disk Management Each VM gets a 36 GB disk (Small SCSI) Disk images are persistent through reboots Two central NASes store all disk images VMs can be allocated on all compute nodes dependent on availability (scheduler) Two solutions: o o iscsi (Network mounted disk) Disk replication on nodes

VM Disk Management iscsi NAS This is what OpenStack Supports Big Issue: Bandwidth Bottleneck at NAS Compute nodes Targets Virtual computex iqn.2001-04.com.nas-0-0-vm-compute-x

A hybrid solution via replication Initial boot of any cluster node uses an iscsi disk(call this a node disk) on the NAS During normal operation, Comet moves a node disk to the physical host that is running the node VM. And then disconnects from the NAS o All Node disk operation is local to the physical host o Fundamentally enables scale out w/o a $1M NAS At Shutdown, any changes made to the node disk (now on the physical host) are migrated back to the NAS, ready for next boot

1.a Init Disk NAS iscsi mount on NAS enables virtual compute node to boot immediately. Read operations from NAS Write operations to local disk Compute nodes Targets Virtual compute-x iqn.2001-04.com.nas-0-0-vm-compute-x Replicate Disk

NAS 1.b Init Disk During boot, the disk image on the NAS is migrated to the physical host. Read-only and read/write are then merged into one local disk iscsi mount is disconnected Compute nodes Targets Virtual compute-x

NAS 2. Steady State During normal operation Node disk is snapshot Incremental snapshots sent to NAS (replicate back to NAS) Timing/load/experiment will tell us how often we can do this Compute nodes Targets Virtual compute-x

3. Release Disk NAS Compute nodes Targets Power off Virtual compute-x At shutdown, any unsynched changes are send back to NAS When the last snapshot is sent, the Virtual compute node can be rebooted on another system

Software to implement is under development Rocks Roll so that it can be a part of any physical Rocks-defined cluster https://github.com/rocksclusters/imgstorage-roll Uses ZFS for disk image storage on NAS and hosting nodes http://zfsonlinux.org RabbitMQ AMQP (Asynchronous Message Q Protocol) http://www.rabbitmq.com/ Pika - library for communication with RabbitMQ from Python

Full Virtualization isn t the only choice: Containers Comet supports fully-virtualized clusters OS of cluster can be almost anything: Windows, Linux, Solaris, (not Mac-OS) Containers are a different way to virtualize the file system and some other elements Network, inter-process communication In Solaris for more than a decade Newly popular in Linux with the docker project Containers must run the same kernel as the host operating system Networking and device support not as flexible (yet) Containers have much smaller software footprints

Full vs. Container Virtualization Full Virtualization (KVM) Container Virtualization (Docker) Physical Host Kernel HW Virtual Host Virtual Host Physical Network Fully-Virtualized have independent kernels/os. Hardware is universal. Memory/CPU defined by def of virtual HW Physical Host Kernel partial /proc Network Bridge Container 1 Container 2 Physical Network Containers Inherits the Host kernel and elements of its hardware. Need cgroups to limit container memory /cpu usage

Still need to Configure Virtual Systems Why Docker (container) virtualization popular Very space efficient if the changes to the base OS file system are small. Changes can be 100 s of megabytes instead of Gbytes Network (latency) performance and/or network topology is less important (topology needed for virtual clusters) Given how quickly (and relatively lightweight) Docker brings up virtual environments, this will be addressed A system is DEFINED by the contents of the file system System libraries, application code, configuration Docker and Full-virtualization need to be configured (No Free Lunch)

Running: Almost the same ~ Rocks-created server running in Docker Container Notice: uptime, processor and date created rocks-created server running fully-virtualized (KVM) Notice: # cpus, uptime

Summary Comet: HPC for the 99% Expand capability to enable virtual clusters Not generic cloud computing Advances in virtualization techniques continue Comet will support fully-virtualized clusters Container-based virtualized (Docker) become quite popular