RDMA over Ethernet - A Preliminary Study



Similar documents
High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

Can High-Performance Interconnects Benefit Memcached and Hadoop?

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Performance Evaluation of the RDMA over Ethernet (RoCE) Standard in Enterprise Data Centers Infrastructure. Abstract:

How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)

Performance Evaluation of InfiniBand with PCI Express

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

High Speed I/O Server Computing with InfiniBand

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Performance of RDMA-Capable Storage Protocols on Wide-Area Network

LS DYNA Performance Benchmarks and Profiling. January 2009

FLOW-3D Performance Benchmark and Profiling. September 2012

SR-IOV In High Performance Computing

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

ECLIPSE Performance Benchmarks and Profiling. January 2009

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

SMB Direct for SQL Server and Private Cloud

Big Data: Hadoop and Memcached

Advancing Applications Performance With InfiniBand

RoCE vs. iwarp Competitive Analysis

Hari Subramoni. Education: Employment: Research Interests: Projects:

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Michael Kagan.

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Storage at a Distance; Using RoCE as a WAN Transport

A Tour of the Linux OpenFabrics Stack

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

Lustre Networking BY PETER J. BRAAM

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers

Ultra Low Latency Data Center Switches and iwarp Network Interface Cards

SR-IOV: Performance Benefits for Virtualized Interconnects!

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

benchmarking Amazon EC2 for high-performance scientific computing

Intel True Scale Fabric Architecture. Enhanced HPC Architecture and Performance

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

State of the Art Cloud Infrastructure

Deploying 10/40G InfiniBand Applications over the WAN

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vsphere 5 R E S E A R C H N O T E

Windows 8 SMB 2.2 File Sharing Performance

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

OFA Training Program. Writing Application Programs for RDMA using OFA Software. Author: Rupert Dance Date: 11/15/

Understanding the Significance of Network Performance in End Applications: A Case Study with EtherFabric and InfiniBand

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

RDMA-based Plugin Design and Profiler for Apache and Enterprise Hadoop Distributed File system THESIS

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

3G Converged-NICs A Platform for Server I/O to Converged Networks

White Paper Solarflare High-Performance Computing (HPC) Applications

INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool

Accelerating Big Data Processing with Hadoop, Spark and Memcached

IEEE Congestion Management Presentation for IEEE Congestion Management Study Group

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

A Flexible Cluster Infrastructure for Systems Research and Software Development

Hadoop on the Gordon Data Intensive Cluster

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Building Enterprise-Class Storage Using 40GbE

Running Native Lustre* Client inside Intel Xeon Phi coprocessor

Configuring VMware vsphere 5.1 with Oracle ZFS Storage Appliance and Oracle Fabric Interconnect

HPC Update: Engagement Model

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

NFS SERVER WITH 10 GIGABIT ETHERNET NETWORKS

Enabling Technologies for Distributed Computing

Performance of Software Switching

PERFORMANCE CONSIDERATIONS FOR NETWORK SWITCH FABRICS ON LINUX CLUSTERS

Interoperability Testing and iwarp Performance. Whitepaper

Sockets vs RDMA Interface over 10-Gigabit Networks: An In-depth analysis of the Memory Traffic Bottleneck

Low Latency 10 GbE Switching for Data Center, Cluster and Storage Interconnect

The following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Performance Comparison of ISV Simulation Codes on Microsoft Windows HPC Server 2008 and SUSE Linux Enterprise Server 10.2

Research Report: The Arista 7124FX Switch as a High Performance Trade Execution Platform

Advanced Computer Networks. High Performance Networking I

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Transcription:

RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University

Outline Introduction Problem Statement Approach Performance Evaluation and Results Conclusions and Future Work

Introduction Ethernet and InfiniBand accounts for majority of interconnects in high performance distributed computing End users want InfiniBand like latencies with existing Ethernet infrastructure Can be achieved if networks converge Existing options have overhead or tradeoffs in terms of performance No solution exists that efficiently combines the ubiquitous nature of Ethernet and the high performance offered by InfiniBand RDMA over Ethernet (RDMAoE) seems to provide a good option as of date

RDMAoE Allows running the IB transport protocol using Ethernet frames RDMAoE packets are standard Ethernet frames with an IEEE assigned Ethertype, a GRH, unmodified IB transport headers and payload InfiniBand HCA takes care of translating InfiniBand addresses to Ethernet addresses and back Encodes IP addresses into its GIDs and resolves MAC addresses using the host IP stack Use GID s for establishing connections instead of LID s No SM/SA, Ethernet management practices are used

InfiniBand Architecture & Adapters An industry standard for low latency, high bandwidth, System Area Networks Multiple features Two communication types Channel Semantics Memory Semantics (RDMA mechanism) Multiple virtual lanes Quality of Service (QoS) support Double Data Rate (DDR) with 20 Gbps bandwidth has been there Quad Data Rate (QDR) with 40 Gbps bandwidth is available recently Multiple generations of InfiniBand adapters are available now The latest ConnectX DDR adapters provide support for both IB as well as RDMAoE modes

Modes of Communication using ConnectX DDR Adapter Applications TCP / IP IPoIB Verbs (with RDMA) Open Fabrics Verbs (with RDMA) Application Protocol ConnectX ConnectX ConnectX ConnectX Adapter Ethernet Switch IB Switch IB Switch Ethernet Switch Switch TCP / IP IPoIB Native IB RDMAoE Network Protocol

Outline Introduction Problem Statement Approach Performance Evaluation and Results Conclusions and Future Work

Problem Statement How do the different communication protocols stack up against each other as far Raw sockets / verbs level performance Performance for MPI applications Performance for Data center applications Does RDMAoE bring us a step closer to the goal of network convergence

Outline Introduction Problem Statement Approach Performance Evaluation and Results Conclusions and Future Work

Approach Protocol level benchmarks to evaluate very basic performance MPI level benchmarks to evaluate basic MPI performance at both point to point and collective levels Application level benchmarks to evaluate performance of real world applications Evaluation using common data center applications

Outline Introduction Problem Statement Approach Performance Evaluation and Results Conclusions and Future Work

Compute Platform Intel Nehalem Experimental Testbed Intel Xeon E5530 Dual quad-core processors operating at 2.40 GHz 12GB RAM, 8MB cache PCIe 2.0 interface Host Channel Adapter Dual port ConnectX DDR adapter Configured in either RDMAoE mode or IB mode Network Switches 24 port Mellanox IB DDR switch 24 port Fulcrum Focalpoint 10GigE switch OFED version OFED-1.4.1 for IB and IPoIB Pre-release version of OFED-1.5 for RDMAoE and TCP / IP MPI version MVAPICH-1.1 and MPICH-1.2.7p1

MVAPICH / MVAPICH2 Software High Performance MPI Library for IB and 10GE MVAPICH (MPI-1) and MVAPICH2 (MPI-2) Used by more than 960 organizations in 51 countries More than 32,000 downloads from OSU site directly Empowering many TOP500 clusters 8 th ranked 62,976-core cluster (Ranger) at TACC Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED) Also supports udapl device to work with any network supporting udapl http://mvapich.cse.ohio-state.edu/

List of Benchmarks OSU Microbenchmarks (OMB) Version 3.1.1 http://mvapich.cse.ohio-state.edu/benchmarks/ Intel Collective Microbenchmarks (IMB) Version 3.2 http://software.intel.com/en-us/articles/intel-mpi-benchmarks/ NAS Parallel Benchmarks (NPB) Version 3.3 http://www.nas.nasa.gov/

Latency (us) Latency (us) 40 35 30 25 20 15 10 5 0 29.9 25.5 3.03 1.66 Verbs Level Evaluation Inter-Node Latency 12000 10000 8000 6000 4000 2000 0 Message Size (Bytes) Native IB RDMAoE TCP/IP IPoIB Message Size (Bytes) Native IB RDMAoE TCP/IP IPoIB For small messages Native IB verbs offers best latency of 1.66 us RDMAoE comes very close to this at 3.03 us

Latency (us) Latency (us) 70 60 50 40 30 20 10 0 47.9 45.4 3.6 1.8 MPI Level Evaluation Inter-Node Latency 35000 30000 25000 20000 15000 10000 5000 0 Message Size (Bytes) Native IB RDMAoE TCP/IP IPoIB Message Size (Bytes) Native IB RDMAoE TCP/IP IPoIB For small messages Native IB verbs offers best latency of 1.8 us RDMAoE comes very close to this at 3.6 us

Latency (us) Latency (us) Inter-Node Multipair Latency 80 70 60 50 40 30 20 10 0 64.7 39.4 3.51 1.66 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 Message Size (Bytes) Native IB RDMAoE TCP/IP IPoIB Message Size (Bytes) Native IB RDMAoE TCP/IP IPoIB 4 pairs of processes communicating simultaneously For small messages Native IB verbs offers best latency of 1.66 us RDMAoE comes very close to this at 3.51 us

Latency (us) Latency (us) Collective Performance Allgather Latency (32-cores) 1000 900 800 700 600 500 400 300 200 100 0 330.6 280.7 22.71 12.97 2500000 2000000 1500000 1000000 500000 0 Message Size (Bytes) Native IB RDMAoE TCP/IP IPoIB Message Size (Bytes) Native IB RDMAoE TCP/IP IPoIB For small messages Native IB verbs offers best latency of 12.97 us RDMAoE comes very close to this at 22.71 us

Latency (us) Latency (us) Collective Performance Allreduce Latency (32-cores) 500 450 400 350 300 250 200 150 100 50 0 329.1 285.4 12.78 10.05 600000 500000 400000 300000 200000 100000 0 Message Size (Bytes) Native IB RDMAoE TCP/IP IPoIB Message Size (Bytes) Native IB RDMAoE TCP/IP IPoIB For small messages Native IB verbs offers best latency of 10.05 us RDMAoE comes very close to this at 12.78 us

Normalized Time Performance of NAS Benchmarks 8 7 6 5 4 3 2 1 0 LU CG EP FT MG NAS Benchmarks Native-IB RDMAoE TCP/IP IP o IB 32 process, Class C Numbers normalized to Native-IB Performance of Native IB and RDMAoE are very close with Native IB giving the best performance

File Transfer Time (s) Evaluation of Data Center Applications 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Message Size (MB) We evaluate FTP, a common data center application We use our own version of FTP over native IB verbs (FTP-ADTS [2]) to evaluate RDMAoE and Native IB GridFTP [1] is used to evaluate performance of TCP/IP and IPoIB RDMAoE shows performance comparable to Native IB Native IB RDMAoE TCP/IP IPoIB [1] http://www.globus.org/grid_software/data/gridftp.php [2] FTP Mechanisms for High Performance Data-Transfer over InfiniBand. Ping Lai, Hari Subramoni, Sundeep Narravula, Amith Mamidala, D K. Panda. ICPP 09.

Outline Introduction Problem Statement Approach Performance Evaluation and Results Conclusions and Future Work

Conclusions & Future Work Perform comprehensive evaluation of all possible modes of communication (Native IB, RDMAoE, TCP/IP, IPoIB) using Verbs MPI Application and, Data center level experiments Native IB gives the best performance followed by RDMAoE RDMAoE provides a high performance solution to the problem of network convergence As part of future work, we plan to Perform large scale evaluations including studies into the effect of network contention on the performance of these protocols Study these protocols in a comprehensive manner for file systems

Thank you! {subramon, laipi, luom, panda}@cse.ohio-state.edu Network-Based Computing Laboratory http://mvapich.cse.ohio-state.edu/