How NVMe and 3D XPoint Will Create a New Datacenter Architecture

Similar documents
Realizing the next step in storage/converged architectures

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Pre-Conference Seminar E: Flash Storage Networking

Enabling High performance Big Data platform with RDMA

Building a Scalable Storage with InfiniBand

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Michael Kagan.

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

RoCE vs. iwarp Competitive Analysis

Solution Brief July All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

Hyper-V over SMB Remote File Storage support in Windows Server 8 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

The Evolution of Microsoft SQL Server: The right time for Violin flash Memory Arrays

Connecting Flash in Cloud Storage

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

SMB Direct for SQL Server and Private Cloud

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Increasing Storage Performance

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

HGST Virident Solutions 2.0

PCI Express IO Virtualization Overview

Can High-Performance Interconnects Benefit Memcached and Hadoop?

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

PCI Express High Speed Networks. Complete Solution for High Speed Networking

Hybrid Software Architectures for Big

Hyper-V over SMB: Remote File Storage Support in Windows Server 2012 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

IOS110. Virtualization 5/27/2014 1

Scaling from Datacenter to Client

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Next Gen Data Center. KwaiSeng Consulting Systems Engineer

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

3G Converged-NICs A Platform for Server I/O to Converged Networks

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

ECLIPSE Performance Benchmarks and Profiling. January 2009

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Microsoft Windows Server Hyper-V in a Flash

Building a Flash Fabric

Building Enterprise-Class Storage Using 40GbE

Overview: X5 Generation Database Machines

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

Building All-Flash Software Defined Storages for Datacenters. Ji Hyuck Yun Storage Tech. Lab SK Telecom

State of the Art Cloud Infrastructure

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

LS DYNA Performance Benchmarks and Profiling. January 2009

Microsoft Windows Server in a Flash

Getting performance & scalability on standard platforms, the Object vs Block storage debate. Copyright 2013 MPSTOR LTD. All rights reserved.

InfiniBand in the Enterprise Data Center

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Configuration Maximums

Oracle Database Scalability in VMware ESX VMware ESX 3.5

SOLUTION BRIEF AUGUST All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

Oracle Database In-Memory The Next Big Thing

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

An Oracle White Paper August Oracle VM 3: Server Pool Deployment Planning Considerations for Scalability and Availability

VMware vsphere Design. 2nd Edition

Configuration Maximums

IBM Storage Technical Strategy and Trends

SSDs: Practical Ways to Accelerate Virtual Servers

Scaling Cloud-Native Virtualized Network Services with Flash Memory

Global Headquarters: 5 Speen Street Framingham, MA USA P F

Traditional v/s CONVRGD

SSDs: Practical Ways to Accelerate Virtual Servers

Cloud Operating Systems for Servers

Your Old Stack is Slowing You Down. Ajay Patel, Vice President, Fusion Middleware

High Performance Computing OpenStack Options. September 22, 2015

Virtualization for Cloud Computing

Masters Project Proposal

InfiniBand vs Fibre Channel Throughput. Figure 1. InfiniBand vs 2Gb/s Fibre Channel Single Port Storage Throughput to Disk Media

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

The future is in the management tools. Profoss 22/01/2008

RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vsphere 5 R E S E A R C H N O T E

New Data Center architecture

IP SAN Fundamentals: An Introduction to IP SANs and iscsi

Advanced Computer Networks. Network I/O Virtualization

Spark: Cluster Computing with Working Sets

Fibre Channel over Ethernet in the Data Center: An Introduction

Windows Server 2008 R2 Hyper-V Server and Windows Server 8 Beta Hyper-V

How To Create A Large Enterprise Cloud Storage System From A Large Server (Cisco Mds 9000) Family 2 (Cio) 2 (Mds) 2) (Cisa) 2-Year-Old (Cica) 2.5

Configuration Maximums

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

PARALLELS CLOUD STORAGE

Fabrics that Fit Matching the Network to Today s Data Center Traffic Conditions

MS EXCHANGE SERVER ACCELERATION IN VMWARE ENVIRONMENTS WITH SANRAD VXL

Virtualization. P. A. Wilsey. The text highlighted in green in these slides contain external hyperlinks. 1 / 16

Cloud Data Center Acceleration 2015

Quantum StorNext. Product Brief: Distributed LAN Client

Network Function Virtualization Using Data Plane Developer s Kit

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Transcription:

How NVMe and 3D XPoint Will Create a New Datacenter Architecture Emilio Billi CTO A3Cube Inc Santa Clara, CA 1

The Storage Paradigm Shift To understand the future of the storage-memory devices we need to consider two things: First Second Modern applications like Data Analytics, Machine Learning Training, Data Mining are using storage actively in the same way that past applications used computation We are moving from HPC to HPD in all the everyday operations Santa Clara, CA 2

We need to start with a little bit of theory Storage Architecture: Symmetric Storage Access Storage Fabric (e.g. NVMe over Fabric) Any Topology NVMe Over FABRIC FC, FCoE Storage Array Old Type of Storage Shared Architecture (The FC concept with a modern variants) Difficult to optimize at application level and application tuning No proximity algorithm Generic Type of Storage (No application specific optimization) Santa Clara, CA 3

We need to start with a little bit of theory (cont.) Fibre Channel Servers NVMe Over Fabric Servers ETH RDMA NICs RDMA FC Controllers Storage Disk Array Storage Disk Array NVMe Disks Santa Clara, CA 4

We need to start with a little bit of theory (cont.) NUFA Architecture: the new concept designed for modern applications and devices Storage Devices Storage Devices Storage Devices Ultra Low Latency approach for NVMe Ready for 3D XPoint Memory Storage Fabric (RMA (PCIe) RDMA (RoCE/IB) Any Topology Massively parallel data access optimization ready (e.g. Spark, Hadoop, ML ) New radical approach Fit with modern application data pattern Extremely optimizable Cost effective (lower Capex and Opex) Santa Clara, CA 5

We need to start with a little bit of theory (cont.) Symmetric Data Access vs NUFA Architecture Symmetric Data Access Latency (No proximity) Device cannon be use as extra caching Non uniform Data Access Latency Lowest Latency optimization Proximity and multiple collaborative caching possible NVMe and 3D XPoint can be used as memory extension (e.g. SSDAlloc locally with no latency penalty ) Hybrid implementations are possible for easy capacity scaling Santa Clara, CA 6

The Importance of the Latency vs Bandwidth What Are bandwidth and Latency and where is the Problem Bandwidth is easier to understand than Latency 100 Gbit/s seem faster than 1 us network Cold Storage relays on Bandwidth Hot storage relays on latency Example of High Bandwidth Move 100s of people in a single shot Efficient on long distance Example of Low Latency Move 1 person extremely fast Design to win in short distance Modern application are latency driven (e.g. Analytics, Fast Databases, Machine Learning ) not bandwidth driven Santa Clara, CA 7

The Importance of the Latency vs Bandwidth (cont.) What we can said about latency: Latency is the most critical performance factor because it directly affects system data exchange time. Latency means lost time; time that could have been spent more productively producing computational results, but it is instead spent waiting for I/O resources to become available. Latency is the application stealth tax, silently extending the elapsed times of individual computational tasks and processes, which then take longer to execute. Once all such delays are accounted for, the overall system performance can drop significantly. System latency performance matters and not just at the storage device level, but across the system, through system, inter nodes fabric, and software applications. Santa Clara, CA 8

The Importance of the Latency vs Bandwidth (cont.) Hardware Latency vs Real-Latency To evaluate the application performance we need to pay attention to the REAL-LATENCY (Hardware+ Drivers, Kernel, APIs ) Example with Ethernet: This is the latency that really impacts on the application Eth Hardware Latency: 100-150 ns, MAC average modern NICs Eth Real-Latency: 10 us Standard TCP/IP (Zero Byte) Santa Clara, CA 9

The Importance of the Latency vs Bandwidth (cont.) Why with NVMe latency is important and with 3D XPoint will be critical Consideration about Real-Latency @ Each hop (e.g. network switches) the situation become worst and worst 10 us < NVMe < 100us Santa Clara, CA 10

The Importance of the Latency vs Bandwidth (cont.) Why with NVMe latency is important and with 3D XPoint will be critical Consideration about Real-Latency Using RDMA the problem can be solved (e.g. NVMe over fabric) RDMA Mellanox 3.44 us But What About 3D XPoint? 10 us < NVMe < 100us Santa Clara, CA 11

3D XPoint in a nut shell 1000X faster THAN NAND 1000X faster THAN NAND 1000X denser THAN CONVENTIONAL MEMORY Santa Clara, CA 12

The Importance of the Latency vs Bandwidth (cont.) Why with NVMe latency is important and with 3D XPoint will be critical Some key points that we need to consider: Ø 3D XPoint average latency R/W 100-500 ns Ø 3D XPoint byte addressable Ø 3D XPoint can act as memory device Ø Can be used by application as memory devices Ø It is not just a faster storage device Santa Clara, CA 13

The Importance of the Latency vs Bandwidth (cont.) Why with NVMe latency is important and with 3D XPoint will be critical With 3D XPoint this latency is not any more enough RDMA Mellanox 3.44 us Remember we are talking about Real-Latency not Hardware Latency 100 ns <3D XPoint< 500ns Santa Clara, CA 14

3D XPoint Scale-Out Challenges Maintaining Ultra Low Latency between data across local and remote devices Use proximity to take maximum advantages form Byte addressability (System Memory Extension) Combining sophisticated technology in a new way: Direct Addressability (extension of mmalloc ) Open Channel architecture managed using distributed kernel approaches or replicated multi-kernels approach to achieve cluster wide optimization Santa Clara, CA 15

3D XPoint Scale-Out Solutions Efficient Memory Clustering 3D XPoint Scale-Out Global memory pools Example and numbers: Advanced use of PCIe direct memory injection @ Cluster level Fully working product on the market (e.g. RONNIEE Express, Dolphin IXH ) CPU as NIC for direct remote memory injection Real-Latency Application to Application Santa Clara, CA 16

3D XPoint Scale-Out Solutions @ Architectural Level Direct CPU communication Proximity for latency optimization Very efficient for In-Memory Databases and system memory live paging extension 3D XPoint Devices 3D XPoint Devices 3D XPoint Devices Storage Fabric (RMA (PCIe) RDMA (RoCE/IB) Any Topology From the concept of NUMA (Non Uniform Memory Access) to the concept of NUFA (Non Uniform File system Access) Santa Clara, CA 17

NVMe and 3D XPoint Advanced Features Other advanced features built in the NVMe interface that enable new kind of efficient architectures @ datacenter level NVMe and 3D XPoint in combination with PCIe cluster wide global memory mapping or over RDMA features permit to open new datanceter scenarios. The Cluster SR-IOV (Single Route IO Virtualization) Santa Clara, CA 18

Introducing Cluster wide SR-IOV How today PCIe SR-IOV is used Well-Know SR-IOV Implementation VM(1) VM(2) VM(3) VM(4) VM(5) A) Single Server à Multiple Virtual Machines B) Hypervisor C) SR-IOV Card in the main server (Host) Virtual NICs (PCIe Virtual Functions VF) Virtual NICs (PCIe Virtual Functions) Santa Clara, CA 19

Introducing Cluster wide SR-IOV (cont.) Distributed PCIe discovery: operating system and virtual machines enable remote discovery, addressing, access and use of standard PCIe devices. Physical Machine (no hypervisor) Dynamic Remote Enumeration Physical Machine (no hypervisor) RONNIEE Express VF1 Standard Server with NVMe Device Standard Server with Virtual Device (NVMe) Santa Clara, CA 20

Introducing cluster wide SR-IOV (cont.) Physical Server or Physical NVMe 1) Operating System Transparent 2) Virtual Device uses original OS driver (no modification) 3) NVMe(s) are seen as local by all the servers ( Un-Supervised Sharing: No software control involved, native direct disks access) Physical Server Physical Server VF1 VF2 VFn Virtual NVMe Virtual NVMe Virtual NVMe Virtual NICs (PCIe Virtual Functions) Utilization made easy using unmodified shared disk file system OCFS2 Oracle GFS Redhat GPFS IBM. Or using multiple partitions inside the disks Physical Server Santa Clara, CA 21

Introducing Cluster wide SR-IOV (cont.) From Unsupervised Sharing To Distributed Supervised Sharing or Physical Server or Physical Server Physical Server Physical Server VF1 VF2 VFn Virtual NVMe Virtual NVMe Virtual NVMe Virtual NICs (PCIe Virtual Functions) Parallel IO Manager Layer Global Name Space Physical Server Santa Clara, CA 22

The Key Points of the Parallel IO Manager Layer Distributed Supervised Sharing Physical Servers or PCIe PCIe PCIe Parallel IO Manager Layer Global Name Space A new way to think about NVMe sharing: 1) Symmetrically sharing devices across different nodes 2) Managing cooperative caching and memory proximity 3) Access distributed data in fully parallel way form any nodes and from any application PCIe Santa Clara, CA 23

How architecture impacts on application performance Santa Clara, CA 24

A Benchmark Application: Hadoop Comparison between: A) Standard implementation with NVMe as storage and Infiniband FDR (IPoIB) 56 Gbit/s B) NUFA Architecture (*) (Same hardware of point A with : Distributed Data Management, Pervasive SR-IOV, direct NVMe data access (with proximity) ) Simple test : 40 GB Teragen (*) Fortissimo Foundation is a commercial product the first introduce all the technology described before (including direct memory injection) on standard server. Santa Clara, CA 25

A Benchmark Symmetric Storage architecture on NVMe A NUFA Storage architecture on NVMe B Execution Time 11.53 min 51 sec Same hardware, same storage, same NVMe different architecture organization (Hardware configuration available on request) Santa Clara, CA 26

Machine Learning Data Path and Architecture Santa Clara, CA 27

Machine Learning Training Data Challenges Massive Datasets Massively Parallel Computation and Data Operation 1000s of Iterations Intense use of Map Reduce Hybrid computing (possibility for direct accelerators storage access) Santa Clara, CA 28

Machine Learning Training Data Challenges Santa Clara, CA 29

Coming back to the architecture High Bandwidth Low Latency It is useful if you need to carry large amount of data from A to B Analytic and ML require to exchange data between 1000s of computational and storage devices cores distributed between different nodes, using different path (not just A to B) Completely unusual in massive parallel scenarios with multiple time critical data path if there is not also low latency combined. NVMe, 3D XPoint are able perfectly to support that but only considering the total architecture Santa Clara, CA 30

Coming back to the architecture (cont.) Faster are the devices better design is require in the overall system architecture Santa Clara, CA 31

Thank you Questions? Santa Clara, CA 32