SR-IOV: Performance Benefits for Virtualized Interconnects!



Similar documents
Comet - High performance virtual clusters to support the long-tail of science.

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

SR-IOV In High Performance Computing

Cloud Computing through Virtualization and HPC technologies

RDMA over Ethernet - A Preliminary Study

benchmarking Amazon EC2 for high-performance scientific computing

1 Bull, 2011 Bull Extreme Computing

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Interoperability Testing and iwarp Performance. Whitepaper

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Masters Project Proposal

NAVAL POSTGRADUATE SCHOOL THESIS

Intel True Scale Fabric Architecture. Enhanced HPC Architecture and Performance

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Open Cirrus: Towards an Open Source Cloud Stack

Hadoop on the Gordon Data Intensive Cluster

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

High Performance Computing in CST STUDIO SUITE

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

State of the Art Cloud Infrastructure

SMB Direct for SQL Server and Private Cloud

Cloud Computing. Alex Crawford Ben Johnstone

Building a Private Cloud with Eucalyptus

DPDK Summit 2014 DPDK in a Virtual World

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

FLOW-3D Performance Benchmark and Profiling. September 2012

Full and Para Virtualization

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vsphere 5 R E S E A R C H N O T E

I/O virtualization. Jussi Hanhirova Aalto University, Helsinki, Finland Hanhirova CS/Aalto

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Sun Constellation System: The Open Petascale Computing Architecture

Performance Evaluation of InfiniBand with PCI Express

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Achieving Performance Isolation with Lightweight Co-Kernels

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

PERFORMANCE CONSIDERATIONS FOR NETWORK SWITCH FABRICS ON LINUX CLUSTERS

Enabling Technologies for Distributed and Cloud Computing

Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro

Michael Kagan.

HPC performance applications on Virtual Clusters

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

ECLIPSE Performance Benchmarks and Profiling. January 2009

Cluster Implementation and Management; Scheduling

Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Understanding the Performance and Potential of Cloud Computing for Scientific Applications

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

ACANO SOLUTION VIRTUALIZED DEPLOYMENTS. White Paper. Simon Evans, Acano Chief Scientist

LS DYNA Performance Benchmarks and Profiling. January 2009

PRIMERGY server-based High Performance Computing solutions

Virtualization. P. A. Wilsey. The text highlighted in green in these slides contain external hyperlinks. 1 / 16

The CNMS Computer Cluster

Cluster Computing at HRI

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

Enabling Technologies for Distributed Computing

The virtualization of SAP environments to accommodate standardization and easier management is gaining momentum in data centers.

Emerging Technology for the Next Decade

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

Automated Testing of Installed Software

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Comparing Cloud Computing Resources for Model Calibration with PEST

A Performance and Cost Analysis of the Amazon Elastic Compute Cloud (EC2) Cluster Compute Instance

Memory Aggregation For KVM Hecatonchire Project

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

Cloud Computing with Red Hat Solutions. Sivaram Shunmugam Red Hat Asia Pacific Pte Ltd.

Virtual Compute Appliance Frequently Asked Questions

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

VON/K: A Fast Virtual Overlay Network Embedded in KVM Hypervisor for High Performance Computing

Overview of HPC systems and software available within

Transcription:

SR-IOV: Performance Benefits for Virtualized Interconnects! Glenn K. Lockwood! Mahidhar Tatineni! Rick Wagner!! July 15, XSEDE14, Atlanta!

Background! High Performance Computing (HPC) reaching beyond traditional application areas. Need for increased flexibility from HPC cyberinfrastructure." Increasingly jobs on XSEDE resources are originating from web-portals/ science gateways. " Science gateways and user communities can develop virtual "compute appliances" that contain tightly integrated application stacks that can be deployed in a hardware-agnostic fashion.! Several vendors also provide software packaged in appliances to enable easy deployment." Such benefits have been an important driving force behind compute clouds such as Amazon Web Services (AWS) EC2." 2

Background! The network bandwidth and latency performance of virtualized systems traditionally has been markedly worse than that of the native hardware." Hardware vendors have been providing an increased support for virtualization within hardware such as the I/O memory management units (IOMMU). " With the standardization and adoption of technologies such as Single- Root I/O Virtualization (SR-IOV) in network device hardware, a road has been paved towards truly high-performance virtualization for highperformance computing applications.! These technologies will be available to XSEDE users via future computing resources (Comet at SDSC: Production starting early 2015)." 3

Single Root I/O Virtualization in HPC! Problem: complex workflows demand increasing flexibility from HPC platforms" Virtualization = flexibility" Virtualization = IO performance loss (e.g., excessive DMA interrupts)" Solution: SR-IOV and Mellanox ConnectX-3 InfiniBand HCAs " One physical function (PF) à multiple virtual functions (VF), each with own DMA streams, memory space, interrupts" Allows DMA to bypass hypervisor to VMs!

High-Performance Virtualization on Comet! Mellanox FDR InfiniBand HCAs with SR-IOV" Rocks to manage high-performance virtual clusters." Flexibility to support complex science gateways and web-based workflow engines" Custom user defined application stack on virtual clusters." Leveraging FutureGrid expertise and experience." High bandwidth filesystem(s) access via virtualized InfiniBand."

Hardware/Software Configurations of Test Clusters! Native InfiniBand (SDSC) SR-IOV InfiniBand (SDSC) Platform Rocks 6.1 (EL6) Rocks 6.1 (EL6) kvm Hypervisor CPUs RAM Intel(R) Xeon E5-2660 (2.2 GHz) 16 cores/node Native 10GbE (SDSC) Rocks 6.1 (EL6) Software- Virtualized 10Gbe (EC2) Amazon Linux 2013.09 (EL6) Xen HVM cc2.8xlarge Instance Intel(R) Xeon E5-2670 (2.6 GHz) 16 cores/node 64 GB DDR3 60.5 GB DDR3 SR-IOV 10GbE (EC2) Amazon Linux 2013.09 (EL6) Xen HVM c3.8xlarge Instance Intel(R) Xeon E5-2680v2 (2.8 GHz) 16 cores/node Inteconnect QDR4X InfiniBand Mellanox ConnectX-3 10GbE 10GbE (Xen Driver) 10GbE (Intel VF driver)

Benchmarks! Fundamental performance characteristics of interconnect evaluated using OSU Micro-Benchmarks latency, unidirectional and bidirectional bandwidth tests." WRF: widely used weather modeling application that is run in both research and operational forecasting. CONUS-12km benchmark used for performance evaluation." Quantum ESPRESSO: Application that performs density functional theory (DFT) calculations for condensed matter problems. DEISA AUSURF112 benchmark." 7

Benchmark Build Details! OSU Micro-Benchmarks (OMB version 3.9) compiled with OpenMPI 1.5 and GCC 4.4.6.! Both test applications were built with Intel Composer XE 2013 and all of the options necessary to allow the compiler to generate 256- byte AVX vector instructions that were available on all of the testing platforms.! OpenMPI 1.5 for both InfiniBand and 10GbE platform tests.! Intel's Math Kernel Library (MKL) 11.0 to provide all BLAS, LAPACK, ScaLAPACK, and FFTW3 functions where necessary.! 8

SR-IOV with 10GbE*: Latency Results! MPI point-to-point latency as measured by the osu_latency benchmark. 12-40% improvement. under virtualized environment with SR-IOV. 2-2.5X slower than native case, even with SR-IOV. * SR-IOV provided with Amazon's C3 instances MPI point-to-point latency as measured by the osu_latency benchmark. Error bars are +/- three standard deviations from the mean.. SR-IOV provides 3 to 4 less variation in latency for small message sizes. 9

SR-IOV with 10GbE: Bandwidth Results! Unidirectional messaging bandwidth never exceeds 500 MB/s (~40% of line speed). Native performance is 1.5-2X faster. Similar results for bidirectional bandwidth. SR-IOV has very little benefit in both cases. SR-IOV helps slightly (13% for random ring, 17% for natural ring) in collective bandwidth tests. Native total ring bandwidth was more than 2X faster than SR-IOV based virtualized results. MPI (a) unidirectional bandwidth and (b) bidirectional bandwidth for 10GbE interconnect as measured by the osu_bw and osu_bibw benchmarks, respectively.. 10

SR-IOV with InfiniBand: Latency! SR-IOV! < 30% overhead for M < 128 bytes" < 10% overhead for eager send/recv" Overhead à 0% for bandwidth-limited regime" Amazon EC2! > 5000% worse latency" Time dependent (noisy)" 50x less latency than Amazon EC2! Figure 5. MPI point-to-point latency measured by osu_latency for QDR InfiniBand. Included for scale are the analogous 10GbE measurements from Amazon (AWS) and non-virtualized 10GbE.. 11 OSU Microbenchmarks (3.9, osu_latency)"

SR-IOV with InfiniBand: Bandwidth! SR-IOV! < 2% bandwidth loss over entire range" > 95% peak bandwidth" Amazon EC2! < 35% peak bandwidth" 900% to 2500% worse bandwidth than virtualized InfiniBand" Figure 6. MPI point-to-point bandwidth measured by osu_bw for QDR InfiniBand. Included for scale are the analogous 10GbE measurements from Amazon (AWS) and non-virtualized 10GbE. 10x more bandwidth than Amazon EC2!. 12 OSU Microbenchmarks (3.9, osu_bw)"

Application Benchmarks: WRF! WRF CONUS-12km benchmark. The domain is 12 KM horizontal resolution on a 425 by 300 grid with 35 vertical levels, with a time step of 72 seconds.! Run using six nodes (96 cores) over QDR4X InfiniBand virtualized with SR-IOV.! SR-IOV test cluster has 2.2 GHz Intel(R) Xeon E5-2660 processors.! Amazon instances were using 2.6 GHz Intel(R) Xeon E5-2670 processors.! 13

Weather Modeling 15% Overhead! 96-core (6-node) calculation" Nearest-neighbor communication" Scalable algorithms" SR-IOV incurs modest (15%) performance hit"...but still still 20% faster *** than Amazon" WRF 3.4.1 3hr forecast" *** 20% faster despite SR-IOV cluster having 20% slower CPUs"

Application Benchmarks: Quantum Espresso! DEISA AUSURF 112 benchmark with Quantum ESPRESSO.! Matrix diagonalization is done with the conjugate gradient algorithm. Communication overhead larger than WRF case.! 3D Fourier transforms stress the interconnect because they perform multiple matrix transpose operations where every MPI rank needs to send and receive all of its data to every other MPI rank.! These global collectives cause interconnect congestion, and efficient 3D FFTs are limited by the bisection bandwidth of the entire fabric connecting all of the compute nodes.! 15

Quantum ESPRESSO: 5x Faster than EC2! 48-core, 3 node calc" CG matrix inversion (irregular comm.)" 3D FFT matrix transposes (All-to-all communication)" 28% slower w/ SR-IOV" SR-IOV still > 500% faster *** than EC2" Quantum Espresso 5.0.2 DEISA AUSURF112 benchmark" *** 20% faster despite SR-IOV cluster having 20% slower CPUs"

Conclusions! SR-IOV: huge step forward in high-performance virtualization" Shows substantial improvement in latency over Amazon EC2, and it has negligible bandwidth overhead! Benchmark application performance confirms this: significant improvement over EC2" SR-IOV: lowers performance barrier to virtualizing the interconnect and makes fully virtualized HPC clusters viable! Comet will deliver virtualized HPC to new/non-traditional communities that need flexibility without major loss of performance!