NVMe over Fabrics - High Performance Flash moves to Ethernet

Similar documents
Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

SMB Direct for SQL Server and Private Cloud

Scaling Cloud-Native Virtualized Network Services with Flash Memory

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Advancing Applications Performance With InfiniBand

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Enabling High performance Big Data platform with RDMA

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Mellanox Accelerated Storage Solutions

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

RoCE vs. iwarp Competitive Analysis

Pre-Conference Seminar E: Flash Storage Networking

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

Realizing the next step in storage/converged architectures

Building a Scalable Storage with InfiniBand

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Building Enterprise-Class Storage Using 40GbE

Cloud Data Center Acceleration 2015

Scaling from Datacenter to Client

Windows 8 SMB 2.2 File Sharing Performance

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Advanced Computer Networks. High Performance Networking I

Storage at a Distance; Using RoCE as a WAN Transport

Your Old Stack is Slowing You Down. Ajay Patel, Vice President, Fusion Middleware

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

The Future of Software-Defined Storage What Does It Look Like in 3 Years Time? Richard McDougall, VMware, Inc

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

State of the Art Cloud Infrastructure

How To Test Nvm Express On A Microsoft I7-3770S (I7) And I7 (I5) Ios 2 (I3) (I2) (Sas) (X86) (Amd)

Ethernet: THE Converged Network Ethernet Alliance Demonstration as SC 09

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Mellanox Academy Online Training (E-learning)

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

OCZ s NVMe SSDs provide Lower Latency and Faster, more Consistent Performance

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

Choosing the Best Network Interface Card Mellanox ConnectX -3 Pro EN vs. Intel X520

Hyper-V over SMB Remote File Storage support in Windows Server 8 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Interconnecting Future DoE leadership systems

RDMA over Ethernet - A Preliminary Study

Microsoft SMB Running Over RDMA in Windows Server 8

Connecting Flash in Cloud Storage

The Transition to PCI Express* for Client SSDs

PCI Express High Speed Networks. Complete Solution for High Speed Networking

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

Architecting a High Performance Storage System

Understanding PCI Bus, PCI-Express and In finiband Architecture

OFA Training Program. Writing Application Programs for RDMA using OFA Software. Author: Rupert Dance Date: 11/15/

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

3G Converged-NICs A Platform for Server I/O to Converged Networks

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

Building All-Flash Software Defined Storages for Datacenters. Ji Hyuck Yun Storage Tech. Lab SK Telecom

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Configuration Maximums VMware vsphere 4.1

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Solution Brief July All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

F600Q 8Gb FC Storage Performance Report Date: 2012/10/30

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Host Memory Buffer (HMB) based SSD System. Forum J-31: PCIe/NVMe Storage Jeroen Dorgelo Mike Chaowei Chen

RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vsphere 5 R E S E A R C H N O T E

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Intel DPDK Boosts Server Appliance Performance White Paper

Converging Data Center Applications onto a Single 10Gb/s Ethernet Network

PCI Express and Storage. Ron Emerick, Sun Microsystems

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

PCI Express IO Virtualization Overview

Windows Server Infrastructure for SQL Server

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)

Zadara Storage Cloud A

New Features in SANsymphony -V10 Storage Virtualization Software

Configuration Maximums VMware vsphere 4.0

Direct Scale-out Flash Storage: Data Path Evolution for the Flash Storage Era

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

NVM Express TM Infrastructure - Exploring Data Center PCIe Topologies

Performance Beyond PCI Express: Moving Storage to The Memory Bus A Technical Whitepaper

An In-Memory RDMA-Based Architecture for the Hadoop Distributed Filesystem

Configuration Maximums

HP Mellanox Low Latency Benchmark Report 2012 Benchmark Report

Michael Kagan.

Demartek June Broadcom FCoE/iSCSI and IP Networking Adapter Evaluation. Introduction. Evaluation Environment

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Configuration Maximums

Understanding Data Locality in VMware Virtual SAN

Storage, Cloud, Web 2.0, Big Data Driving Growth

Storage Architectures. Ron Emerick, Oracle Corporation

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

SOLUTION BRIEF AUGUST All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

Quantum StorNext. Product Brief: Distributed LAN Client

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Next Generation Storage Networking for Next Generation Data Centers

Transcription:

NVMe over Fabrics - High Performance Flash moves to Ethernet Rob Davis & Idan Burstein Mellanox Technologies

Why NVMe and NVMe over Fabrics(NVMf) 2

Compute/Storage Disaggregation Application Resource Pool 25/40/50/100GbE NVMeoF 25/40/50/100GbE NVMeoF

Hyperconverged with NVMf 4

What Makes NVMe Faster X PCIe ~80usec ~30usec X 5

NVMe Performance 2x-4x more bandwidth, 50-60% lower latency, Up to 5x more IOPS 6

NVMf is the Logical and Historical next step Sharing NVMe based storage across multiple servers/cpus Better utilization capacity, rack space, power Scalability, management, fault isolation NVMf Standard 1.0 was completed in early June 2016 7

How Does NVMf Maintain Performance The idea is to extend the efficiency of the local NVMe interface over a fabric Ethernet or IB NVMe commands and data structures are transferred end to end Relies on RDMA to match DAS performance Bypassing TCP/IP 8

Why not Traditional TCP/IP Network Stack 100% Network Networked Storage Protocol and Network Storage Media Storage Media PM HDD Access Time (micro-sec) 100 15? 0.01 HDD NVM PM 50% FC, TCP RDMA RDMA+ Storage Protocol (SW) Network 7

What is RDMA Network 100GbE RoCE adapter based transport 8

Early Pre-standard Demonstrations April 2015 NAB Las Vegas 10Gb/s Reads, 8Gb/s Writes 2.5M Random Read 4 KB IOPs Latency ~8usec over local 11

Micron FMS 2015 Demonstration 100GbE RoCE 12

Pre-standard Drivers Converge to V1.0 V1.0 14

NVMf Standard 1.0 Community Open Source Driver Development 15

Performance with Community Driver Two compute nodes ConnectX4-LX 25Gbps port One storage node ConnectX4-LX 50Gbps port 4 X Intel NVMe devices (P3700/750 series) Nodes connected through switch Added fabric latency ~12us BS = 4KB, 16 jobs, IO depth = 64 Bandwidth (target) IOPS (target) # online cores Each core utilization 5.2GB/sec 1.3M 4 50% 15

Kernel & User Based NVMeoF SPDK 16

Some NVMeoF Demos at FMS & IDF 2016 Flash Memory Summit E8 Storage Mangstor With initiators from VMs on VMware ESXi Micron Windows & Linux initiators to Linux target Newisis (Sanmina) Pavilion Data in Seagate booth Intel Developer Forum E8 Storage HGST (WD) NVMeoF on InfiniBand Intel: NVMe over Fabrics with SPDK Mellanox 100GbE NICs Newisis (Sanmina) Samsung Seagate 17

RDMA Transport built on simple primitives deployed for 15 years in the industry Queue Pair (QP) RDMA communication end point Connect for establishing connection mutually RDMA Registration of memory region (REG_MR) for enabling virtual network access to memory SEND and RCV for reliable two-sided messaging RDMA READ and RDMA WRITE for reliable onesided memory to memory transmission 18

SEND/RCV Host HCA HCA Host Memory Post RCV Post SEND SEND First SEND Middle SEND Last ACK RCV Completion SEND Completion 19

RDMA WRITE Host HCA HCA Host Memory 20

RDMA READ Host HCA HCA Host Memory RDMA READ Register Memory Region READ RDMA READ Completion READ RESPONSE READ RESPONSE 21

NVMe and NVMeoF Fit Together Well Network 22

NVMe-OF IO WRITE Host Post SEND carrying Command Capsule (CC) Subsystem Upon RCV Completion Allocate Memory for Data Post RDMA READ to fetch data Upon READ Completion Post command to backing store Upon completion Send NVMe-OF RC Free memory Upon SEND Completion Free CC and completion resources Free send buffer Free data buffer NVMe Initiator Post Send (CC) Completion Completion RNIC Send Command Capsule Ack RDMA Read Read response first Read response last Send Response Capsule Ack RNIC Completion Post Send (Read data) Completion Post Send (RC) Completion NVMe Target Allocate memory for data Register to the RNIC Post NVMe command Wait for completion Free allocated memory Free Receive buffer Free send buffer

NVMe-OF IO READ Host Post SEND carrying Command Capsule (CC) Subsystem Upon RCV Completion Allocate Memory for Data Post command to backing store Upon completion Post RDMA Write to write data back to host Send NVMe RC Upon SEND Completion Free memory Free CC and completion resources Free send buffer NVMe Initiator Post Send (CC) Completion Completion RNIC Send Command Capsule Ack Write first Write last Ack Send Response Capsule Ack RNIC Completion NVMe Target Post Send (Write data) Post Send (RC) Completion Completion Post NVMe command Wait for completion Free receive buffer Free allocated buffer Free send buffer

NVMe-OF IO WRITE IN-Capsule Host Post SEND carrying Command Capsule (CC) Subsystem Upon RCV Completion Allocate Memory for Data Upon completion Send NVMe-OF RC Free memory Upon SEND Completion Free CC and completion resources Free send buffer NVMe Initiator Post Send (CC) Completion Completion RNIC Send Command Capsule Ack Send Response Capsule Ack RNIC Completion Post Send (RC) Completion NVMe Target Post NVMe command Wait for completion Free receive buffer Free send buffer

E2E Flow Control RDMA Appliacation RNIC RNIC RDMA Application Receive resources credits are transported from RDMA responder to RDMA requester Piggybacked in the acknowledgments Requester limit it s transmission to number of outstanding receive WQEs Benefits Optimizes the amount of outstanding buffers assuming single connection, independently from the application flow control Optimizes the use of the fabric QP SQ RQ Post Send Post Send Post Send CQE Post Send CQE Post Send Post Send CQE CQE CQE Limited Limited Send Send Send Ack(CC=0) Ack(CC=0) Ack(CC=1) Send Ack(Credits=2) Send Send Ack Ack CQE CQE Post Receive CQE Post Receive Post Receive CQE CQE CQE QP SQ RQ 26

Memory Consumption Staging Buffer Intermediate buffer allocated for data fetch from Host / Subsystem Allocated on demand, function of the parallelism of the backing store Receive Buffer In RC QP, allocated according to the maximum parallelism of a single queue Scales linearly with number of connections 27

Shared Receive Queue Work Queue Share receive buffering resources between QPs According the the parallelism of the application E2E credits is being managed by RNR NACK with TO associated with the application latency Number of connections is the same as RC w/o SRQ WQE WQE Memory buffer Memory buffer Memory buffer Memory buffer Memory buffer Memory buffer RQ RQ RQ RQ 28

Extended Reliable Connection Transport XRC SRQ Application receive buffer XRC Initiator Transport level reliability with a single XRC TGT Capable of sending messages to the plurality of XRC SRQs in the target XRC Target Transport level reliability with a single XRC INI Spread the work across the XRC SRQ according to the packet 29

CMB Introduction Internal memory of the NVMe devices exposed over the PCIe Few MB are enough to buffer the PCIe bandwidth for the latency of the NVMe device Latency ~ 100-200usec, Bandwidth ~ 25-50 GbE Capacity ~ 2.5MB Enabler for peer to peer communication of data and commands between RDMA capable NIC and NVMe 30

Scalability of Memory Bandwidth Using Peer-Direct and CMB Data CMD IO Write with Peer-Direct, NVMf target offload and CMB 10/25/50/100 GbE 10/25/50/100 GbE 10/25/50/100 GbE 10/25/50/100 GbE 10/25/50/100 GbE 10/25/50/100 GbE NIC NIC CMD NIC CMD Memory Memory Memory Compute Compute Compute PCIe Switch PCIe Switch PCIe Switch Data

PCIe 4.0 Accelerating CPU / Memory - Interconnect Performance 4.0 3.0 1.0 2003-2004 2.5Gb/s per lane 8/10 bit encoding 2.0 2007-2008 5Gb/s per lane 8/10 bit encoding 2011-2012 8Gb/s per lane 128/130 bit encoding 2015-2016 16Gb/s per lane 128/130 bit encoding 4.0 New Capabilities Higher Bandwidth (16 to 25Gb/s per lane) Cache Coherency Atomic Operations Advanced Power Management Memory Management

Conclusions Future Storage solutions will be able to deliver DAS storage performance over a network: NVMe s new NVMe protocol eliminates HDD legacy bottlenecks Fast network Faster storage needs faster networks! NVMf with RDMA new NVMf protocol running over RDMA is within microseconds of DAS 33

Thanks! robd@mellanox.com idanb@mellanox.com