RoCE vs. iwarp Competitive Analysis



Similar documents
Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Choosing the Best Network Interface Card Mellanox ConnectX -3 Pro EN vs. Intel X520

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Performance Evaluation of the RDMA over Ethernet (RoCE) Standard in Enterprise Data Centers Infrastructure. Abstract:

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

A Tour of the Linux OpenFabrics Stack

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

SMB Direct for SQL Server and Private Cloud

Introduction to Cloud Design Four Design Principals For IaaS

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

State of the Art Cloud Infrastructure

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

Building Enterprise-Class Storage Using 40GbE

ConnectX -3 Pro: Solving the NVGRE Performance Challenge

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Connecting the Clouds

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

High Speed I/O Server Computing with InfiniBand

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Building a Scalable Storage with InfiniBand

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

Ethernet: THE Converged Network Ethernet Alliance Demonstration as SC 09

Whitepaper. Implementing High-Throughput and Low-Latency 10 Gb Ethernet for Virtualized Data Centers

Meeting the Five Key Needs of Next-Generation Cloud Computing Networks with 10 GbE

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Low Latency 10 GbE Switching for Data Center, Cluster and Storage Interconnect

Ultra Low Latency Data Center Switches and iwarp Network Interface Cards

Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro

Storage at a Distance; Using RoCE as a WAN Transport

OFA Training Program. Writing Application Programs for RDMA using OFA Software. Author: Rupert Dance Date: 11/15/

Replacing SAN with High Performance Windows Share over a Converged Network

3G Converged-NICs A Platform for Server I/O to Converged Networks

Mellanox Accelerated Storage Solutions

Network Virtualization for Large-Scale Data Centers

Mellanox WinOF for Windows 8 Quick Start Guide

Long-Haul System Family. Highest Levels of RDMA Scalability, Simplified Distance Networks Manageability, Maximum System Productivity

Windows TCP Chimney: Network Protocol Offload for Optimal Application Scalability and Manageability

SX1024: The Ideal Multi-Purpose Top-of-Rack Switch

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

SX1012: High Performance Small Scale Top-of-Rack Switch

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

Enabling High performance Big Data platform with RDMA

Installing Hadoop over Ceph, Using High Performance Networking

Security in Mellanox Technologies InfiniBand Fabrics Technical Overview

VXLAN Overlay Networks: Enabling Network Scalability for a Cloud Infrastructure

New Data Center architecture

VXLAN Performance Evaluation on VMware vsphere 5.1

Cloud Computing and the Internet. Conferenza GARR 2010

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

InfiniBand in the Enterprise Data Center

Increase Simplicity and Improve Reliability with VPLS on the MX Series Routers

VXLAN: Scaling Data Center Capacity. White Paper

Mellanox Global Professional Services

HP Mellanox Low Latency Benchmark Report 2012 Benchmark Report

RDMA over Ethernet - A Preliminary Study

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

Converging Data Center Applications onto a Single 10Gb/s Ethernet Network

White Paper Solarflare High-Performance Computing (HPC) Applications

Mellanox Academy Online Training (E-learning)

A Low-Latency Solution for High- Frequency Trading from IBM and Mellanox

Network-Oriented Software Development. Course: CSc4360/CSc6360 Instructor: Dr. Beyah Sessions: M-W, 3:00 4:40pm Lecture 2

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

NVGRE Overlay Networks: Enabling Network Scalability for a Cloud Infrastructure

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Power Saving Features in Mellanox Products

Advanced Computer Networks. High Performance Networking I

Telecom - The technology behind

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Optimize Server Virtualization with QLogic s 10GbE Secure SR-IOV

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

Demartek June Broadcom FCoE/iSCSI and IP Networking Adapter Evaluation. Introduction. Evaluation Environment

Virtual Network Exceleration OCe14000 Ethernet Network Adapters

Linux NIC and iscsi Performance over 40GbE

Block based, file-based, combination. Component based, solution based

Mellanox Reference Architecture for Red Hat Enterprise Linux OpenStack Platform 4.0

Architecting Low Latency Cloud Networks

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Network Function Virtualization Using Data Plane Developer s Kit

Stateful Inspection Technology

Chapter 11. User Datagram Protocol (UDP)

Interoperability Testing and iwarp Performance. Whitepaper

PRODUCTS & TECHNOLOGY

IP SAN Fundamentals: An Introduction to IP SANs and iscsi

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

QoS & Traffic Management

Brocade One Data Center Cloud-Optimized Networks

Top Ten Reasons for Deploying Oracle Virtual Networking in Your Data Center

Virtualizing the SAN with Software Defined Storage Networks

Technical Brief. DualNet with Teaming Advanced Networking. October 2006 TB _v02

Mellanox OpenStack Solution Reference Architecture

Transcription:

WHITE PAPER August 21 RoCE vs. iwarp Competitive Analysis Executive Summary...1 RoCE s Advantages over iwarp...1 Performance and Benchmark Examples...3 Best Performance for Virtualization...4 Summary... Executive Summary Remote Direct Memory Access (RDMA) provides direct access from the memory of one computer to the memory of another without involving either computer s operating system. This technology enables high-throughput, low-latency networking with low CPU utilization, which is especially useful in massively parallel compute clusters. RDMA over Converged Ethernet (RoCE) is the most commonly used RDMA technology for Ethernet networks and is deployed at scale in some of the largest hyper-scale data centers in the world. RoCE is the only industry-standard Ethernet-based RDMA solution with a multi-vendor ecosystem delivering network adapters and operating over standard layer 2 and layer 3 Ethernet switches. The RoCE technology is standardized within industry organizations including the IBTA, IEEE, and IETF. Mellanox Technologies was the first company to implement the new standard, and its ConnectX -3 Pro and ConnectX-4 product families implement a complete offload of the RoCE protocol. These solutions provide wire-speed throughput at up to 1Gb/s throughput and market-leading latency, with the lowest CPU and memory utilization possible. As a result, the ConnectX family of adapters has been deployed in a variety of mission critical, latency sensitive data centers. RoCE s Advantages over iwarp iwarp is an alternative RDMA offering that is more complex and unable to achieve the same level of performance as RoCE-based solutions. iwarp uses a complex mix of layers, including DDP (Direct Data Placement), a tweak known as MPA (Marker PDU Aligned framing), and a separate RDMA protocol (RDMAP) to deliver RDMA services over TCP/IP. This convoluted architecture is an ill-conceived attempt to fit RDMA into existing software transport frameworks. Unfortunately this compromise causes iwarp to fail to deliver on precisely the three key benefits that RoCE is able to achieve: high throughput, lowlatency, and low CPU utilization. In addition to the complexity and performance disadvantages, only a single vendor (Chelsio) is supporting iwarp on their current products, and the technology has not been well adopted by the market. Intel previously supported iwarp in its 1GbE NIC from 29, but has not supported it in any of its newer NICs since then. No iwarp support is available at the latest Ethernet speeds of 2,, and 1Gb/s.

page 2 iwarp is designed to work over the existing TCP transport, and is essentially an attempt to patch up existing LAN/WAN networks. The Ethernet data link delivers best effort service, relying on the TCP layer to deliver reliable services. The need to support existing IP networks, including wide area networks, requires coverage of a larger set of boundary conditions with respect to congestion handling, scaling, and error handling, causing inefficiency in hardware offload of the RDMA and associated transport operations. RoCE, on the other hand, is a purpose-built RDMA transport protocol for Ethernet, not as a patch to be used on top of existing TCP/IP protocols. iwarp Network Layers RDMA APPLICATION VERBS INTERFACE OFA (Open Fabric Alliance) Stack RoCE Network Layers RDMA APPLICATION VERBS INTERFACE OFA (Open Fabric Alliance) Stack RDMAP DDP MPA IBTA Transport Protocol TCP IP Ethernet Link Layer UDP IP Ethernet Link Layer ETHERNET MANAGEMENT ETHERNET MANAGEMENT Figure 1. iwarp s complex network layers vs. RoCE s simpler model Because TCP is connection-based, it must use reliable transport. iwarp, therefore, only supports reliable connected transport service, whch also means that is is not an appropriate platform for multicast. RoCE offers a variety of transport services, including reliable connected, unreliable datagram, and others, and enables user-level multicast capability. iwarp traffic also cannot be easily managed and optimized in the fabric itself, leading to inefficiency in deployments. It does not provide a way to detect RDMA traffic at or below the transport layer, for example within the fabric itself. Sharing of TCP s port space by iwarp makes using flow management impossible, since the port alone cannot identify whether the message carries RDMA or traditional TCP. iwarp shares the protocol number space with legacy TCP traffic, so context (state) are required to determine that a packet is iwarp. Typically, this context may not fit in the NIC s on-chip memory, which results in much more complexity and therefore longer time in traffic demultiplexing. This also occurs in the switches and routers of the fabric, where there is no such state available. In contrast, a packet can be identified as RoCE simply by looking at its UDP destination port field. If the value matches the IANA assigned port for RoCE then the packet is RoCE. This stateless traffic identification allows for quick and early demultiplexing of traffic in a converged NIC implementation, and enables capabilities such as switch or fabric monitoring and access control lists (ACLs) for improved traffic flow analysis and management. Similarly, because iwarp shares port space with the legacy TCP stack, it also faces challenges integrating with OS stacks. RoCE, on the other hand, offers full OS stack integration. These challenges limit the cost-effectiveness and deployability of iwarp products, especially in comparison to RoCE. RoCE includes IP and UDP headers in the packet encapsulation, meaning that RoCE can be used across both L2 and L3 networks. This enables layer 3 routing, which brings RDMA to networks with multiple subnets. Finally, by deploying Soft-ROCE (Figure 2), the implementation of RoCE via software, RoCE can be expanded to devices that do not natively support RoCE in hardware. This enables greater flexibility in leveraging RoCE s benefits in the Data Center.

page 3 Application User Soft-RoCE user driver HW-RoCE user driver Sockets Kernel TCP/IP Soft-RoCE kernel driverr HW-RoCE kernel driver NIC driver HW Ethernet NIC RoCE HCA Figure 2. Soft-RoCE Architecture Performance and Benchmark Examples EDC latency-sensitive applications such as Hadoop for real-time data analysis are cornerstones of competitiveness for Web2. and Big Data providers. Such platforms can benefit from Mellanox s ConnectX-3 Pro, as its RoCE solution delivers extremely low latencies on Ethernet while scaling to handle millions of messages per second. A benchmark that was run comparing the performance of Chelsio s T messaging application running over 4Gb Ethernet iwarp and Connectx-3 Pro with RoCE shows that RoCE consistently delivered messages significantly faster than iwarp (Figure 3). 4GbE RDMA Latency 6 4 3 2 1 16 496 Latency (μs) Chelsio iwarp Figure 3. 4Gb Ethernet Latency Benchmark When measuring ConnectX-3 s RoCE latency against Intel s NetEffect 2 iwarp, the results are even more impressive. At 1Gb, RoCE showed an 6% improvement using RoCE at B message size, and a % improvement at B (Figure 4). Message Size (Bytes) NetEffect 2 iwarp ConnectX-3 1GbE RoCE ConnectX-3 4GbE RoCE B 7.22μs 1μs.7μs B 1.μs 3.79μs 2.13μs Figure 4. Mellanox RoCE and Intel iwarp Latency Benchmark

page 4 4GbE RDMA Throughput (RDMA Write) Bandwidth (Gb/s) 4 3 3 2 2 1 1 2.3 1.2 4.6 2.3 9.1 4.4 1.2.6 33. 1.2 36.7 37. 31. 3.4 Chelsio iwarp Figure. 4Gb Ethernet Throughput Benchmark Meanwhile, throughput when using RoCE at 4Gb on ConnectX-3 Pro is over 2X higher than using iwarp on the Chelsio T (Figure ), and X higher that using iwarp at 1Gb with Intel (Figure 6). Bandwidth (Gb/s) 1 9 7 6 4 3 2 1 1GbE RDMA Throughput (RDMA Write) 1..3 3.6.6.3 1.1 6.9 2.2.2 4.2 9. 7. 9.1 Intel iwarp Figure 6 1Gb Ethernet Throughput Benchmark ConnectX-3 RoCE Best Performance for Virtualization A further advantage to RoCE is its ability to run over SR-IOV, enabling RoCE s superior performance of the lowest latency, lowest CPU utlilization, and maximum throughput, in a virtualized environment. RoCE has proven it can provide less than 1 us latency between virtual machines while maintaining consistent throughput as the virtual environment scales. Chelsio s iwarp does not run over multiple VMs in SR-IOV, relying instead on TCP for VM-to-VM communication. The difference in latency is astounding (Figure 7). 2 VM-to-VM Latency RoCE vs. TCP Latency (μs) 2 1 1 16 496 iwarp 4GbE Figure 7. 4Gb Ethernet Latency from VM to VM

page Summary RoCE simplifies the transport protocol; it bypasses the TCP stack to enable true and scalable RDMA operations, resulting in higher ROI. RoCE is a standard protocol, which was built specifically with data center traffic in mind, with consideration paid to latency, performance, and CPU utilization. It performs especially well in virtualized environments. When a network runs over Ethernet, RoCE provides a superior solution compared to iwarp. For the enterprise data center seeking the ultimate in performance, RoCE is clearly the choice, especially when latency-sensitive applications are involved. Moreover, RoCE is currently deployed in dozens of data centers with up to hundreds of thousands of nodes, while iwarp is virtually non-existent in the field. Simply put, RoCE is the obvious way to deploy RDMA over Ethernet. 3 Oakmead Parkway, Suite 1, Sunnyvale, CA 94 Tel: 4-97-34 Fax: 4-97-343 www.mellanox.com Copyright 21. Mellanox Technologies. All rights reserved. Mellanox, Mellanox logo, and ConnectX are registered trademarks of Mellanox Technologies, Ltd. All other trademarks are property of their respective owners. 1-414WP Rev 1.1