How Ethernet RDMA Protocols iwarp and RoCE Support NVMe over Fabrics

Similar documents
How To Test Nvm Express On A Microsoft I7-3770S (I7) And I7 (I5) Ios 2 (I3) (I2) (Sas) (X86) (Amd)

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

RoCE vs. iwarp Competitive Analysis

OFA Training Program. Writing Application Programs for RDMA using OFA Software. Author: Rupert Dance Date: 11/15/

Performance Evaluation of the RDMA over Ethernet (RoCE) Standard in Enterprise Data Centers Infrastructure. Abstract:

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

Best Practice and Deployment of the Network for iscsi, NAS and DAS in the Data Center

A Tour of the Linux OpenFabrics Stack

Data Center Convergence. Ahmad Zamer, Brocade

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Building Enterprise-Class Storage Using 40GbE

Visions for Ethernet Connected Drives. Vice President, Dell Oro Group March 25, 2015

PCI Express IO Virtualization Overview

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

Cloud Data Management Interface (CDMI) The Cloud Storage Standard. Mark Carlson, SNIA TC and Oracle Chair, SNIA Cloud Storage TWG

Block Storage in the Open Source Cloud called OpenStack

Ethernet: THE Converged Network Ethernet Alliance Demonstration as SC 09

Converging Data Center Applications onto a Single 10Gb/s Ethernet Network

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Cloud File Services: October 1, 2014

Advanced Computer Networks. High Performance Networking I

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

SMB 3.0 New Opportunities for Windows Environments

High Speed I/O Server Computing with InfiniBand

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches

High Performance Computing OpenStack Options. September 22, 2015

Microsoft SMB Running Over RDMA in Windows Server 8

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

High-Performance Networking for Optimized Hadoop Deployments

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Windows TCP Chimney: Network Protocol Offload for Optimal Application Scalability and Manageability

VDI Optimization Real World Learnings. Russ Fellows, Evaluator Group

Fibre Channel over Ethernet in the Data Center: An Introduction

Unified Fabric: Cisco's Innovation for Data Center Networks

Emulex s OneCore Storage Software Development Kit Accelerating High Performance Storage Driver Development

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

3G Converged-NICs A Platform for Server I/O to Converged Networks

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Network Virtualization Technologies and their Effect on Performance

Storage at a Distance; Using RoCE as a WAN Transport

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Cisco Datacenter 3.0. Datacenter Trends. David Gonzalez Consulting Systems Engineer Cisco

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

IMPLEMENTATION AND COMPARISON OF ISCSI OVER RDMA ETHAN BURNS

The Transition to PCI Express* for Client SSDs

SCSI The Protocol for all Storage Architectures

FCoE Deployment in a Virtualized Data Center

IP SAN Fundamentals: An Introduction to IP SANs and iscsi

FIBRE CHANNEL OVER ETHERNET

QoS & Traffic Management

OFA Training Programs

New Data Center architecture

Server and Storage Consolidation with iscsi Arrays. David Dale, NetApp

Data Center Bridging Plugfest

Operating Systems Design 16. Networking: Sockets

Interoperable Cloud Storage with the CDMI Standard

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

IEEE Congestion Management Presentation for IEEE Congestion Management Study Group

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

Configure Windows 2012/Windows 2012 R2 with SMB Direct using Emulex OneConnect OCe14000 Series Adapters

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Block based, file-based, combination. Component based, solution based

Network Function Virtualization Using Data Plane Developer s Kit

Intel DPDK Boosts Server Appliance Performance White Paper

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

iscsi Top Ten Top Ten reasons to use Emulex OneConnect iscsi adapters

Overview and Frequently Asked Questions Sun Storage 10GbE FCoE PCIe CNA

Data Center Bridging Attributes. John Fastabend LAN Access Division, Intel Corp.

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

Fiber Channel Over Ethernet (FCoE)

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Storage Architectures. Ron Emerick, Oracle Corporation

Multi-Path RDMA. Elad Raz Mellanox Technologies

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

A Packet Forwarding Method for the ISCSI Virtualization Switch

Cloud Computing and the Internet. Conferenza GARR 2010

D1.2 Network Load Balancing

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

SMB Direct for SQL Server and Private Cloud

Pre-Conference Seminar E: Flash Storage Networking

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

A way towards Lower Latency and Jitter

How Linux kernel enables MidoNet s overlay networks for virtualized environments. LinuxTag Berlin, May 2014

State of the Art Cloud Infrastructure

Enabling High performance Big Data platform with RDMA

Datacenter Operating Systems

Solid State Storage in a Hard Disk Package. Brian McKean, LSI Corporation

Network Attached Storage. Jinfeng Yang Oct/19/2015

PCI Express Impact on Storage Architectures and Future Data Centers

Lustre Networking BY PETER J. BRAAM

AHCI and NVMe as Interfaces for SATA Express Devices - Overview

Transcription:

How Ethernet RDMA Protocols iwarp and RoCE Support NVMe over Fabrics PRESENTATION TITLE GOES HERE John Kim, Mellanox David Fair, Intel January 26, 2016

Who We Are John F. Kim Director, Storage Marketing Mellanox Technologies David Fair Chair, SNIA-ESF Ethernet Networking Mktg Mgr Intel Corp. 2

SNIA Legal Notice! The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted.! Member companies and individual members may use this material in presentations and literature under the following conditions:! Any slide or slides used must be reproduced in their entirety without modification! The SNIA must be acknowledged as the source of any material used in the body of any document containing material from these presentations.! This presentation is a project of the SNIA Education Committee.! Neither the author nor the presenter is an attorney and nothing in this presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney.! The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information. NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK. 3

Agenda How RDMA fabrics fit into NVMe over Fabrics RDMA explained and how it benefits NVMe/F Verbs, the lingua franca of RDMA Varieties of Ethernet RDMA explained Deployment considerations for RDMA-enhanced Ethernet 4

How RDMA Fabrics Fit Into NVMe over Fabrics PRESENTATION TITLE GOES HERE

We Are Not Covering NVMe Over Fabrics Here Today For a comprehensive introduction to NVMe/Fabrics, please watch the SNIA-ESF webcast Under the Hood with NVMe over Fabrics produced December 2015 by J Metz (Cisco) and Dave Minturn (Intel) Posted on the SNIA-ESF website under Webcasts On Demand : http://www.snia.org/forums/esf/knowledge/webcasts We are focusing on how RDMA fits into NVMe/Fabrics A detailed understanding of the NVMe/F spec is not required 6

That Said, NVMe/F Expands NVMe to Fabrics! Adds message-based NVMe operations! Leverages common NVMe architecture with additional definitions! Allows remote and shared access to NVMe subsystems! Standardization of NVMe over a range Fabric types! Initial fabrics: RDMA (RoCE, iwarp, InfiniBand ) and Fibre Channel! First release candidate specification in early 2016! NVMe.org Fabrics WG developing Linux host and target drivers 7

Why NVMe Over Fabrics! End-to-End NVMe semantics across a range of topologies! Retains NVMe efficiency and performance over network fabrics! Eliminates unnecessary protocol translations (e.g. SCSI)! Enables low-latency and high IOPS remote NVMe storage solutions 8

NVMe Queuing Operational Model NVMe Host Driver Command Id OpCode NSID Buffer Address (PRP/SGL) EnQueue SQE Target-Side Submission Queue Command Id OpCode NSID 1 Buffer Address 2 (PRP/SGL) DeQueue SQE NVMe Controller Command Id OpCode NSID Buffer Address (PRP/SGL) Command Parameters Host-Side Completion Queue Command Parameters Fabric Port Command Parameters Command Parm SQ Head Ptr Command Status P Command Id DeQueue CQE 4 Command Parm SQ Head Ptr Command Status P 3 Command Id EnQueue CQE Command Parm SQ Head Ptr Command Status P Command Id Host NVMe NVM Subsystem! 1. Host Driver enqueues the Submission Queue Entries into the SQ! 2. NVMe Controller dequeues Submission Queue Entries! 3. NVMe Controller enqueues Completion Queue Entries into the CQ! 4. Host Driver dequeues Completion Queue Entries

NVMe Over Fabrics Capsules! NVMe over Fabric Command Capsule! Encapsulated NVMe SQE Entry! May contain additional Scatter Gather Lists (SGL) or NVMe Command Data! Transport agnostic Capsule format (not used in RDMA)! NVMe over Fabric Response Capsule! Encapsulated NVMe CQE Entry! May contain NVMe Command Data! Transport agnostic Capsule format

Fabric Ports! Subsystem Ports are associated with Physical Fabric Ports! Multiple NVMe Controllers may be accessed through a single port! NVMe Controllers each associated with one port! Fabric Types; PCIe, RDMA (Ethernet RoCE/iWARP, InfiniBand ), Fibre Channel/FCoE Fabric Fabric Port NVM Subsystem Fabric Port NVMe Controller NVMe Controller 11

Key Points About NVMe/F NVMe built from the ground up to support a consistent model for NVM interfaces, even across network fabrics Host sees networked NVM as if local NVMe commands and structures are transferred end-to-end Maintains the NVMe architecture across a range of fabric types Simplicity enables hardware automated I/ O Queues NVMe transport bridge No translation to or from another protocol like SCSI (in firmware/software) Separation between control traffic (administration) and data I/O traffic Inherent parallelism of NVMe multiple I/O Queues exposed to the host 12

RDMA Explained and Why Chosen for NVMe/F PRESENTATION TITLE GOES HERE

What Is Remote Direct Memory Access (RDMA)? RDMA is a host-offload, host-bypass technology that allows an application (including storage) to make data transfers directly to/from another application s memory space The RDMA-capable Ethernet NICs (RNICs) not the host manage reliable connections between source and destination Applications communicate with the RDMA NIC using dedicated Queue Pairs (QPs) and Completion Queues (CQs) Each application can have many QPs and CQs Each QP has a Send Queue (SQ) and Receive Queue (RQ) Each CQ can be associated with multiple SQs or RQs 14

Benefits of Remote Direct Memory Access Bypass of system software stack components that processes network traffic For user applications (outer rails), RDMA bypasses the kernel altogether For kernel applications (inner rails), RDMA bypasses the OS stack and the system drivers User S/W User Application IO Library Direct data placement of data from one machine (real or virtual) to another machine without copies Increased bandwidth while lowering latency, jitter, and CPU utilization Great for networked storage! Kernel S/W H/W Standard NIC Flow Kernel Apps OS Stack Sys Driver PCIe Transport & Network (L4/ L3) Ethernet (L1/ L0) RDMA NIC Flow

Details on RDMA Performance Benefits Benefit RDMA Technique CPU Util. Latency Mem bw Offload network transport (e.g. TCP/IP) from Host ü ü Eliminate receive memory copies with tagged buffers ü ü Reduce context switching with OS bypass (map NIC hardware resources into user space) ü Define an asynchronous verbs API (sockets is synchronous) Preserve message boundaries to enable application (e.g. SCSI) header/data separation Message-level (not packet-level) interrupt coalescing ü ü ü ü ü 16

Low NVMe Latency Exposes Network Latencies Storage Media Technology Access Time Access in Time Micro (micro- Sec) Seconds 1000 10 0.1 HDD SSD Future NVM HD SSD NVM As storage latency drops, network latency becomes important Both the physical network and the network software stack add latency CPU interrupts and utilization also matter Faster storage requires faster networks

Queues, Capsules, and More Queues Example of Host Write To Remote Target Host NVMe Queues NVMe Submission Queue NVMe Host Driver encapsulates the NVMe Submission Queue Entry (including data) into a fabric-neutral Command Capsule and passes it to the NVMe RDMA Transport Capsules are placed in Host RNIC RDMA Send Queue and become an RDMA_SEND payload Target RNIC at a Fabric Port receives Capsule in an RDMA Receive Queue RNIC places the Capsule SQE and data into target host memory RNIC signals the RDMA Receive Completion to the target s NVMe RDMA Transport Target processes NVMe Command and Data Target encapsulates the NVMe Completion Entry into a fabric-neutral Response Capsule and passes it to NVMe RDMA Transport NVMe Completion Queue Host RDMA Queues RDMA Send Queue RDMA Receive Queue Target RDMA Queues RDMA Send Queue RDMA Receive Queue Target NVMe Controller Queues NVMe Submission Queue NVMe Completion Queue

NVMe Multi-Queue Host Interface Maps Neatly to the RDMA Queue-Pair Model Standard (local) NVMe NVMe Controller NVMe Submission and Completion Queues are aligned to CPU cores No inter-cpu software locks Per CQ MSI-X interrupts enable source core interrupt steering NVMe Over RDMA Fabric (*Possible RDMA QP mapping shown) Retains NVMe SQ/CQ CPU alignment No inter-cpu software locks Source core interrupt steering retained by using RDMA Event Queue MSI-X interrupts 19

Verbs, the lingua franca of RDMA PRESENTATION TITLE GOES HERE

The Application RDMA Programming Model Is Defined By Verbs (IETF draft 1 and InfiniBand spec 2 ) Verbs are the common standardized basis of the different RDMA system software APIs Verbs also provide a behavioral model for RNICs Requires new programming model not sockets SMB Direct, iser, and NFSoRDMA storage protocols take advantage of verbs in system software This makes RDMA transparent to applications NVMe/F adopts similar approach and generates the necessary verbs to drive the fabric No applications changes or rewrites required! Remote NVMe devices just look local to the host 1. http://tools.ietf.org/html/draft-hilland-rddp-verbs-00 2. https://cw.infinibandta.org/document/dl/7859, Chapter 11 21

More On Verbs A few of the most common Verbs: PostSQ Work Request (WR): transmit data (or a read request) to remote peer PostRQ WR: provide the RDMA NIC with empty buffers to fill with untagged (unsolicited) messages from remote peer Poll for Completion: Obtain a Work Completion from RDMA NIC A SQ WR completes when the RDMA NIC guarantees its reliable delivery to remote peer A RQ WR completes when its buffer has been filled by a received message Request Completion Notification: Request an interrupt on issue of a CQ Work Completion 22

Server OS Support for RDMA Verbs Windows Server Network Direct userspace API supported since Windows HPC Server 2008 Network Direct Kernel API supported since Windows Server 2012 Linux Userspace/kernel APIs supported by the OpenFabrics Alliance since 2004 Upstream in most popular server distros, including RHEL and SLES FreeBSD OpenFabrics userspace/ kernel APIs supported since 2011 (FreeBSD 9.0+) 23

Varieties of Ethernet RDMA explained PRESENTATION TITLE GOES HERE

Both iwarp and RoCE Provide Ethernet RDMA Services RoCE is based on InfiniBand transport over Ethernet RoCEv2 enhances RoCE with a UDP header and Internet routability Uses IP but not TCP RoCEv2 uses InfiniBand transport on top of Ethernet iwarp is layered on top of TCP/IP Offloaded TCP/IP flow control and management Both iwarp and RoCE (and InfiniBand) support verbs NVMe/F using Verbs can run on top of either transport 25

Underlying ISO Stacks Of the Flavors of Ethernet RDMA

Deployment Considerations for RDMA Enhanced Ethernet PRESENTATION TITLE GOES HERE

Compatibility Considerations iwarp and RoCE are software-compatible if written to the RDMA Verbs iwarp and RoCE both require RNICs iwarp and RoCE cannot talk RDMA to each other because of L3/L4 differences iwarp adapters can talk RDMA only to iwarp adapters RoCE adapters can talk RDMA only to RoCE adapters 28

Ethernet RDMA Vendor Ecosystem RoCE Supported by IBTA and RoCE Alliance Avago (Emulex), Mellanox Adapter support promised by QLogic, some startups iwarp supported by Chelsio and Intel Support from Intel in a future server chipset Adapter support promised by QLogic, some startups Both RoCE and iwarp run on all major Ethernet switches (Arista, Cisco, Dell, HPE, Mellanox, etc.) 29

Network Deployment Considerations Data Center Bridging iwarp can benefit from an lossless DCB fabric but does not require DCB because it uses TCP RoCE and RoCEv2 require an lossless DCB fabric Similar to FCoE requirements but across the L2 subnet RoCEv2 is L3 routable Minimum of Priority Flow Control (PFC) All major enterprise switches support DCB Congestion management iwarp leverages TCP/IP (e.g., windowing), RFC3168 ECN, and other IETF standards RoCE can use RoCE Congestion Management, which leverages ECN 30

Summary NVMe/F requires the low network latency that RDMA can provide RDMA reduces latency, improves CPU utilization NVMe/F supports RDMA verbs transparently No changes to applications required NVMe/F maps NVMe queues to RDMA queue pairs RoCE and iwarp are software compatible (via Verbs) but do not interoperate because their transports are different RoCE and iwarp Different vendors and ecosystem Different network infrastructure requirements 31

For More Information On RDMA Enabled Ethernet For iwarp iwarp, the Movie : https://www.youtube.com/watch?v=ksxmfzxqmbq Chelsio Communications white papers: http://www.chelsio.com/white-papers/ For RoCE RoCE Initiative: http://www.roceinitiative.org/ InfiniBand Trade Association: http://www.infinibandta.org/ Mellanox: http://www.mellanox.com Avago (Emulex): http://www.emulex.com/ 32

After This Webcast! Please rate this Webcast and provide us with feedback! This Webcast and a PDF of the slides will be posted to the SNIA Ethernet Storage Forum (ESF) website and available on-demand! http://www.snia.org/forums/esf/knowledge/webcasts! A full Q&A from this webcast, including answers to questions we couldn't get to today, will be posted to the SNIA-ESF blog! http://sniaesfblog.org/! Follow us on Twitter @SNIAESF 33

Thank you! PRESENTATION TITLE GOES HERE