NVM Express: Bringing PCIe Speed to SSDs Session D-101 Tom Burniece Moderator Akber Kazmi, David Akerson, Mike James Presenters Santa Clara, CA USA October 2013 1
AGENDA PCI Express Supersedes SAS and SATA in Storage Current PCIe Use Cases Future PCIe Applications NVM Express: Unlock Your Solid State Drive s Potential Overview Basics Status Santa Clara, CA USA October 2013 2
PCI Express Supersedes SAS and SATA in Storage Akber Kazmi PLX Technology Santa Clara, CA USA October 2013 3
Agenda PCIe Roadmap/History Quick Overview of PCIe Enhancements in PCIe for New Applications Vendor Defined Enhancements New Usage Models and Applications Summary Santa Clara, CA USA October 2013 4
SAS / SATA / PCIe Roadmaps SAS4? 24Gb/s Or SASe 16 Gb/s SATAe 16Gb/s PCIe Gen4 16Gb/s SAS 12Gb/s SATAe* 8Gb/s PCIe Gen3 8Gb/s Speed SAS 6Gb/s SATA 6Gb/s SAS 3 Gb/s PCIe Gen2 5Gb/s SATA 3 Gb/s PCIe Gen1 2.5Gb/s 2000 2006 2012 2018 * No SATA 3 Adopted PCI Express (SATAe) Santa Clara, CA USA October 2013 5
Quick Overview of PCIe Technology Replaces PCI bus with full software compatibility Serial point-to-point Packet based protocol Three generations of evolution Speeds - 2.5, 5.0 and 8.0 G/lane Support mix of Gen 1, 2 & 3 ports Ports can scale to x1 to x32 Low Latency x1 x2
PCIe Protocol Enhancements SR-IOV Sharing of I/O with multiple VMs DPC/eDPC Handling of surprise down SRIS Clock-less cabling PCIe Cable - New Spec for low cost cable PCIe Retimer Standardize re-timer Santa Clara, CA USA October 2013 7
PF VF 1 VF M VMM VM 1 VM N Single Root I/O Virtualization (SR-IOV) SR-IOV defines shared devices in PCIe Devices offer multiple virtual functions (VFs) The VMM assigns one or more VFs to a VM Higher performance & lower cost, foot-print & power Server Server PCIe Switch SR- IOV EP SR- IOV EP Santa Clara, CA USA October 2013 8
SRIS Clocks SRIS (Separate Refclk Independent SSC) Enables new PCIe applications like Cables Allow each side of the link to have its own clock Two kinds of clocking supported 600 ppm difference with no SSC (aka SRNS) 5600 ppm difference with independent SSC (aka SRIS)
PCIe Cable PCIe cable spec (a.k.a. OcuLink) Targeted for low cost Optical and Copper cables Spec will address Active, half active and passive cables Scope of definition Electrical, Mechanical, FF, signaling, power, etc. Expected completion 2014
PCIe Re-timers / Repeaters A new ECN for re-timer in review Current re-timers don t work well in PCIe Gen3 systems Defining repeater and re-timer function for PCIe3 Useful in active cable apps Will also be used in boards and backplanes
DPC/eDPC Downstream Port Containment (DPC) Prevents host time-out or blue screen When triggered by error from an endpoint Take down link to endpoint Send error message/interrupt to host Reply to read request in time-out window Completion synthesis for root ports Host CPU Host SW to handle switch responses PCIe Switch I/O Device I/O Device I/O SSD Device I/O Device
Vendor-Defined PCIe Enhancements Sharing SR-IOV device with multiple hosts Host isolation/failover Embedded DMA controller Host-to-host communication Santa Clara, CA USA October 2013 13
SSD Adapter Card ASIC1 ASIC2 Two SSD Controllers NVMe Form-Factor 2.5 Only ASIC1 (controller) exposed to Host ASIC1 servicing interrupts from ASIC2 Switch ASIC ASIC Switch ASIC ASIC ASIC ASIC ASIC ASIC Two or more SSD controllers aggregated with a switch Adapter Card Form-Factor Host isolation capability One ASIC exposed to Host Services interrupts from other ASICs
Server Motherboards Capacity expansion in server enclosures Aggregation through PCIe switch NIC Server Motherboard RAM CPU Switch RAM Santa Clara, CA August 2011 15
FLASH Appliances IO IO Expansion Box PCIe over Cu/Opt Servers Switch Switch Santa Clara, CA August 2011 16
Future PCIe Applications Expanded Use in Data Centers Santa Clara, CA USA October 2013 17
Hide Some Endpoints from The Host One endpoint visible to host, rest are hidden Exposed endpoint communicates with the host and manages hidden endpoint Host CPU Devices on left not visible to Host PCIe Switch I/O or SSD Device I/O or SSD Device I/O or SSD Device I/O or SSD Device Green endpoint visible to host
Flash Appliances with I/O Sharing Share SR-IOV SSD modules in expansion chassis Shared by multiple servers/hosts or blade servers IO IO Expansion Box PCIe over Cu/Opt Servers Switch Switch Santa Clara, CA August 2011 19
Today s Data Center Rack Storage Multiple Fabrics 1-2 Switches Networking CPU/Host LOM/NIC Ethernet Bridging Devices CPU/Host LOM/NIC HBA CPU/Host LOM/NIC HBA CPU/Host LOM/NIC HBA CPU/Host CPU/Host PCI Express LOM/NIC HBA HBA Fibre Channel PCI Express Already available in Every Box
PCI Express ExpressFabric Rack Scale Integration HBA NIC Shared I/O to Drop Costs & Power Pushes the Network to Top of Rack..Removing Costly NICs/HBAs ExpressFabric Switch Converged Fabric PCI Express CPU/Host* CPU/Host Retimer PCI Express High Performance 32Gbs Links (x4) Scalable to x8, x16 CPU/Host Switch CPU/Host Retimer CPU/Host CPU/Host PCI Express Retimer Retimer Simple ~ $5 and ~ 1 Watt PCIe Retimers Summary ~½ the Cost ~½ the Power
ExpressFabric Rack Scale Integration HBA NIC Flash Flash Flash Flash Flash Flash Flash Flash GPU GPU GPU GPU GPU GPU us us us us us us CPU/Host SSD Controller SSD Controller GPU GPU GPU us us us PCI Express ExpressFabric Switch us Switch Switch Switch Retimer PCIe Switch Box Serving the Rack Can be in any location Enables Shared SSD Storage PCIe offers the Highest Performance GPGPU Computing PCIe is Standard for GPUs Microservers PCIe as Future Fabric
Summary PCIe offers lowest latency and BW scalability PCIe has been adopted in many new applications & expected to expand in others PCI-SIG has added new features to help support current & future use of PCIe PCIe vendors adding new features to enhance performance & ease of use in data centers Very well suited for emerging applications Santa Clara, CA USA October 2013 23
NVM Express: Unlock Your Solid State Drive s Potential October 23, 2013 David Akerson, Intel Corporation Mike James, SanDisk Corporation
NVM Express NVM Express (NVMe) is a standardized high performance host controller interface for PCIe SSDs designed for this and next generation Non-volatile Memory Developed by an open industry consortium of over 90 companies and directed by a 13 company Promoter Group Architected from the ground up for Non-volatile Memory to address Enterprise and Client system needs Reduces latency and provides faster performance with support for security and end-to-end data protection 25 Architected for Performance
NVM Express Solid State Drives Operating System and Applications Processor / Chipset PCIe Root Complex PCIe NVMe SSD PCIe NVMe SSD PCIe PCIe SAS / SATA SAS / SATA SAS / SATA PCIe PCIe PCIe PCIe Traditional interfaces developed for HDD Up to 200 IOPs Inefficiencies hidden NVM Express architected for SSDs Over 3 million IOPs NVMe SSD PCIe Host Bus Adapter, RAID on Chip, or I/O Controller Inefficiencies exposed PCIe SAS / SATA SAS / SATA SAS / SATA NVMe Controller SAS/ SATA SSD SAS/ SATA SSD SAS/ SATA SSD NVM 26 Architected for Performance
NVM Express Usage Models Server Caching Server Storage Client Storage External Storage SAN SAN Root Complex NVMe Root Complex Root Complex Controller A Root Complex Root Complex Controller B PCIe Switch x16 x4 NVMe NVMe NVMe x16 PCIe/PCIe RAID x4 NVMe NVMe NVMe Platform Controller Hub (PCH) NVMe SATA HDD SAS PCIe Switch x16 NVMe NVMe NVMe NVMe PCIe Switch x16 SAS SAS HDD Used for temporary data Non-redundant Used to reduce memory footprint Typically for persistent data Commonly used as Tier-0 storage Redundant (i.e. RAID ed) Used for Boot/OS drive and/or HDD cache Non-redundant Power optimized Used for just metadata or all data Multi-ported device Redundancy based on usage 27 Architected for Performance
Performance Advantage Case Study NVM Express reduces latency overhead by more than 50% SCSI/SAS: NVMe: 2.8 µs 6.0 µs 19,500 cycles 9,100 cycles Linux * Storage Stack User Apps User Kernel VFS / File System Block Layer Req Queue Chatham NVMe Prototype Prototype Measured IOPS Cores Used for 1M IOPs NVMe Driver SCSI Xlat SAS Driver 1.02M 2.8 µsecs 6.0 µsecs Measurement taken on Intel Core i5-2500k 3.3GHz 6MB L3 Cache Quad-Core Desktop Processor using Linux RedHat * EL6.0 2.6.32-71 Kernel. *Other 28 names and brands may be claimed as the property of others. Architected for Performance
StorNVMe Delivers a Great Solution StorNVMe Implementation Highlights Uses hardened Enterprise Storage Stack Strives for 1:1 mapping of queues to processors NUMA optimized Asynchronous notification supported Interrupt coalescing supported Rigorous testing on Windows* Firmware Update/Download (via IOCTL) With great IOPs Source: Intel Developers Forum Session SSDS004, Robin Alexander, Microsoft System Configuration: 2 Socket Romley-based Server, 16 physical processors (32), Random Read Workload, Tool: IOmeter Architected for Performance
StorNVMe Delivers a Great Solution StorNVMe Implementation Highlights Uses hardened Enterprise Storage Stack Strives for 1:1 mapping of queues to processors NUMA optimized Asynchronous notification supported Interrupt coalescing supported Rigorous testing on Windows* Firmware Update/Download (via IOCTL) With great IOPs And low latency Source: Intel Developers Forum Session SSDS004, Robin Alexander, Microsoft System Configuration: 2 Socket Romley-based Server, 16 physical processors (32), Random Read Workload, Tool: IOmeter Architected for Performance
Next Generation Scalable NVM Scalable Resistive Memory Element Wordlines Memory Element Selector Device Cross Point Array in Backend Layers ~4l 2 Cell Family Phase Change Memory Magnetic Tunnel Junction (MTJ) Electrochemical Cells (ECM) Binary Oxide Filament Cells Interfacial Switching Resistive RAM NVM Options Defining Switching Characteristics Energy (heat) converts material between crystalline (conductive) and amorphous (resistive) phases Switching of magnetic resistive layer by spin-polarized electrons Formation / dissolution of nano-bridge by electrochemistry Reversible filament formation by Oxidation-Reduction Oxygen vacancy drift diffusion induced barrier modulation Many candidate next generation NVM technologies. Offer ~ 1000x speed-up over NAND, closer to DRAM
Fully Exploiting Next Generation NVM With Next Generation NVM, the NVM is no longer the bottleneck Need optimized platform storage interconnect Need optimized software storage access methods Architected for Performance
Benefits of NVMe Standardization Standard Drivers Eliminates need for OEMs to qualify driver for each SSD vendor Enables broad adoption across a wide range of industry standard and proprietary operating systems Consistent Feature Set All SSDs implement required baseline features Optional features are implemented in a consistent manner Industry Ecosystem Development tools (analyzers, emulators, test platforms) IP cores and controllers Compliance and interoperability testing 33 Architected for Performance
NVM Express Basics Architected for Performance
NVMe Structure NVM Express Specification Queuing Interface Command Set Admin Command Set I/O Command Sets NVM Cmd Set Rsvd #1 Rsvd #2 Rsvd #3 35 Architected for Performance
Queuing Interface 1 Queue Command Host 7 Process Completion Ring Doorbell New Tail Tail Submission Queue Host Memory Head Completion Queue Ring Doorbell New Head 2 8 Head Tail Submission Queue Tail Doorbell 3 4 5 6 Completion Queue Head Doorbell Fetch Command Process Command Queue Completion NVMe Controller Generate Interrupt Command Submission 1. Host writes command to submission queue 2. Host writes updated submission queue tail pointer to doorbell 36 Command Processing 3. Controller fetches command 4. Controller processes command Command Completion 5. Controller writes completion to completion queue 6. Controller generates MSI-X interrupt 7. Host processes completion 8. Host writes updated completion queue head pointer to doorbell
Scalable Queuing Interface Host Controller Managment Admin Submission Queue Admin Completion Queue I/O Submission Queue Core 0 I/O Completion Queue I/O Submission Queue Core 1 I/O Submission Queue I/O Completion Queue... I/O Submission Queue Core N I/O Completion Queue MSI-X MSI-X MSI-X MSI-X NVMe Controller Enables NUMA optimized drivers One or more I/O submission queues, completion queue, and MSI-X interrupt per core High performance and low latency command issue No locking between cores Up to ~2 32 outstanding commands Support for up to ~64K I/O submission and completion queues Each queue supports up to ~64K outstanding commands 37 Architected for Performance
Efficient Command Interface Scalable Queuing Interface All parameters for 4KB command in single 64B DMA fetch Interrupt Coalescing 38 Architected for Performance
Normalized CPU Utilization per I/O 2.0 1.5 1.0 0.5 0.0 SATA SAS NVMe 25K IOP workload for each device 39 Architected for Performance
NVMe Command Sets Command Set Admin Command Set I/O Command Sets NVM Cmd Set Rsvd #1 Rsvd #2 Rsvd #3 40 Architected for Performance
Simple Optimized Command Set Admin Commands Create I/O Submission Queue Delete I/O Submission Queue Create I/O Completion Queue Delete I/O Completion Queue Get Log Page Identify Abort Set Features Get Features Asynchronous Event Request Firmware Activate (optional) Firmware Image Download (optional) NVM Admin Commands Format NVM (optional) Security Send (optional) Security Receive (optional) NVM I/O Commands Read Write Flush Write Uncorrectable (optional) Compare (optional) Dataset Management (optional) Write Zeros (optional) Reservation Register (optional) Reservation Report (optional) Reservation Acquire (optional) Reservation Release (optional) 41 Only 10 Admin commands and 3 I/O commands are required! Architected for Performance
Protection Information Bit 7 6 5 4 3 2 1 0 Data Metadata 0 1 MSB Guard LSB PI Byte 2 3 4 MSB MSB Application Tag LSB Protection Info in First 8B of Metadata 5 6 Reference Tag Data Metadata 7 MSB LSB Protection Information (PI) Protection Info in Last 8B of Metadata PI Protection Information has same fields and format as that in T10 Compatible with T10 Type 0, 1, 2 and 3 protection Protection information may be located in the first 8B of metadata or the last 8B of metadata 42 Architected for Performance
Protection Information Strip and Insert Write LBA Read LBA NVMe SSD NVMe SSD Host LBA Data Controller LBA Data NVM Host LBA Data Controller LBA Data NVM No Protection Information No Protection Information NVMe SSD NVMe SSD Host LBA Data PI LBA Data PI Controller NVM Host LBA Data PI LBA Data PI Controller NVM Pass Protection Information Pass Protection Information NVMe SSD NVMe SSD Host LBA Data LBA Data PI Controller NVM Host LBA Data LBA Data PI Controller NVM Insert Protection Information Strip Protection Information 43 Architected for Performance
PCIe Memory Metadata Options Metadata Buffer Metadata 3 Metadata 2 Metadata 1 Metadata 3 Data 3 Data 3 Metadata 2 Data Buffer Data 2 Data Metadata Buffer Data 2 Metadata 1 Data Metadata Data 1 Data 1 PCIe Memory PCIe Memory Data and Metadata In Separate Buffers ( DIX Mode ) Data and Metadata In Single Buffer ( DIF Mode ) 44 Architected for Performance
Scatter Gather Lists (SGLs) Host Physical Memory C 3 C 1 Read Cmd Starting LBA Num Logical Blks NVM Data C 4 C 0 Address C 0 C 1 C 2 C 3 Length a b c d C 0 C 1 C 2 Flash D 0 Flash D 1 Flash D 2 Flash D 3 C 2 Address C 4 C 5 Length e f C 3 C 4 Flash D 4 Flash D 5 Flash D 6 C 5 C 5 Flash D 7 Infinite flexibility at the expense of complex out-of-order data delivery 45 Architected for Performance
NVM Express Status Architected for Performance
NVM Express Products First plugfest held May 2013 with 11 companies participating Three devices on Integrator s List Samsung announced first NVM Express * (NVMe) product in July NVM Express products targeting Datacenter shipping this year Architected for Performance
NVM Express Specifications 48 NVM Express (NVMe) 1.0 published March, 2011 NVMe 1.1 published October, 2012 Enterprise: Multi-path I/O and namespace sharing Client: Lower power through autonomous power state transitions Future Directions Namespace Management Namespace Inventory Notice Event Firmware Activation without Reset More power management enhancements Power state performance and transitional energy Runtime power removal SGL enhancements Metadata optimization Enhanced error reporting Architected for Performance
Driver Development on Major OSs Windows * Linux * Unix Solaris * VMware * UEFI Windows * 8.1 and Windows* Server 2012 R2 include inbox driver Open source driver in collaboration with OFA Native OS driver since Linux * 3.3 (Jan 2012) FreeBSD driver upstream; ready for release Solaris driver will ship in S12 vmklinux driver certified release in Dec 2013 Open source driver available on SourceForge Native OS drivers already available, with more coming! *Other names and brands may be claimed as the property of others. Architected for Performance
Datacenter: Hot Plug Support Hot Add and Hot Remove are software management events During boot, the system must configure registers and resources for hot plug Existing BIOS and Linux / Windows OS drivers are prepared to support this today Surprise Remove requires careful software design Removing the device while IO outstanding is not recommended in SAS, nor in PCIe Requires driver hardening to guard against consuming synthesized values a valid data on error conditions Platform or kernel needs to filter errors from links after surprise link down event Complex Devices require more preallocated resources PCI Segment support to overcome bus limitations Normal Hot Plug Cases Being Addressed
Datacenter Challenges Connector Topologies Cabling Thermals Enclosure Management Ecosystem solutions address challenges See IDF SSDS003 presentation from Intel s James Myers for details
Bringing NVM Express to Market Ecosystem Supporting a Complete Solution *Other names and brands may be claimed as the property of others.
NVM Express Summary Development core philosophy: Simplicity and Efficiency Architected for performance Scalable from Client to Enterprise Standardized, consistent feature set Supports the current and next generation of NVM NVM Express is mature Specifications Drivers Conformance and Interoperability Testing Products beginning 2H 13 53 Unlock your Solid-State Drive s potential Visit nvmexpress.org for more information Architected for Performance
Thank you! Questions? Architected for Performance