RDMA GPU Direct for iomemory GPU Technology Conference: S4265



Similar documents
Data Center Storage Solutions

Data Center Solutions

ioscale: The Holy Grail for Hyperscale

Scaling from Datacenter to Client

Data Center Solutions

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

EMC XtremSF: Delivering Next Generation Storage Performance for SQL Server

HP Z Turbo Drive PCIe SSD

The Data Placement Challenge

Performance Beyond PCI Express: Moving Storage to The Memory Bus A Technical Whitepaper

How To Get A Memory Memory Device From A Flash Flash To A Memory Card (Iomemory) For A Microsoft Flash Memory Card From A Microsable Memory Card For A Flash (Ios) For An Iomemories Memory

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

OCZ s NVMe SSDs provide Lower Latency and Faster, more Consistent Performance

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

SMB Direct for SQL Server and Private Cloud

NV-DIMM: Fastest Tier in Your Storage Strategy

HP PCIe IO Accelerator For Proliant Rackmount Servers And BladeSystems

An Overview of Flash Storage for Databases

Accelerating I/O- Intensive Applications in IT Infrastructure with Innodisk FlexiArray Flash Appliance. Alex Ho, Product Manager Innodisk Corporation

Building a Scalable Storage with InfiniBand

Intel RAID SSD Cache Controller RCS25ZB040

Boost Database Performance with the Cisco UCS Storage Accelerator

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Solid State Storage in Massive Data Environments Erik Eyberg

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

Flash Performance for Oracle RAC with PCIe Shared Storage A Revolutionary Oracle RAC Architecture

Accelerating Server Storage Performance on Lenovo ThinkServer

Benchmarking Cassandra on Violin

Mellanox Accelerated Storage Solutions

ZD-XL SQL Accelerator 1.5

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

How To Write On A Flash Memory Flash Memory (Mlc) On A Solid State Drive (Samsung)

WHITE PAPER 1

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011

How it can benefit your enterprise. Dejan Kocic Netapp

MS Exchange Server Acceleration

Understanding endurance and performance characteristics of HP solid state drives

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

SUN STORAGE F5100 FLASH ARRAY

HP SN1000E 16 Gb Fibre Channel HBA Evaluation

How SSDs Fit in Different Data Center Applications

Diablo and VMware TM powering SQL Server TM in Virtual SAN TM. A Diablo Technologies Whitepaper. May 2015

ebay Storage, From Good to Great

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

Virtualization of the MS Exchange Server Environment

QuickSpecs. HP Z Turbo Drive

NAND Flash Architecture and Specification Trends

Flash 101. Violin Memory Switzerland. Violin Memory Inc. Proprietary 1

Amazon EC2 Product Details Page 1 of 5

Best Practices for Optimizing SQL Server Database Performance with the LSI WarpDrive Acceleration Card

The Flash-Transformed Financial Data Center. Jean S. Bozman Enterprise Solutions Manager, Enterprise Storage Solutions Corporation August 6, 2014

MS EXCHANGE SERVER ACCELERATION IN VMWARE ENVIRONMENTS WITH SANRAD VXL

REFERENCE ARCHITECTURE. PernixData FVP Software and Splunk Enterprise

FUSION iocontrol HYBRID STORAGE ARCHITECTURE 1

Cisco UCS B460 M4 Blade Server

Building a Flash Fabric

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

ECLIPSE Performance Benchmarks and Profiling. January 2009

Enabling the Flash-Transformed Data Center

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Accelerating Database Applications on Linux Servers

HPe in Datacenter HPe3PAR Flash Technologies

Lab Validation Report

Architecting High-Speed Data Streaming Systems. Sujit Basu

Sun Constellation System: The Open Petascale Computing Architecture

Getting the Most Out of Flash Storage

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

io3 Enterprise Mainstream Flash Adapters Product Guide

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE. Matt Kixmoeller, Pure Storage

Supreme Court of Italy Improves Oracle Database Performance and I/O Access to Court Proceedings with OCZ s PCIe-based Virtualized Solution

Server Forum Copyright 2014 Micron Technology, Inc

SALSA Flash-Optimized Software-Defined Storage

SSDs tend to be more rugged than hard drives with respect to shock and vibration because SSDs have no moving parts.

The Technologies & Architectures. President, Demartek

Technical Paper. Performance and Tuning Considerations for SAS on Fusion-io ioscale Flash Storage

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Vormetric and SanDisk : Encryption-at-Rest for Active Data Sets

Cloud Data Center Acceleration 2015

How it can benefit your enterprise. Dejan Kocic Hitachi Data Systems (HDS)

QuickSpecs. PCIe Solid State Drives for HP Workstations

Datacenter Operating Systems

Optimizing SQL Server Storage Performance with the PowerEdge R720

Frequently Asked Questions March s620 SATA SSD Enterprise-Class Solid-State Device. Frequently Asked Questions

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Flash Storage Optimizing Virtual Desktop Deployments

Enabling High performance Big Data platform with RDMA

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

Solid State Technology What s New?

Answering the Requirements of Flash-Based SSDs in the Virtualized Data Center

How To Test Nvm Express On A Microsoft I7-3770S (I7) And I7 (I5) Ios 2 (I3) (I2) (Sas) (X86) (Amd)

Benefits of Solid-State Storage

IOS110. Virtualization 5/27/2014 1

Deep Dive: Maximizing EC2 & EBS Performance

Transcription:

Robert Wipfel David Atkisson Vince Brisebois RDMA GPU Direct for iomemory GPU Technology Conference: S4265 Copyright 2014, Inc. All rights reserved.

Abstract Learn how to eliminate I/O bottlenecks by integrating iomemory Flash storage into your GPU apps Technical overview of s PCIe attached iomemory Best practices and tuning for GPU applications using iomemory based flash storage Topics will cover threading, pipe-lining, and data path acceleration via RDMA GPU Direct Sample code showing integration between RDMA GPU Direct and iomemory will be presented 2

Outline iomemory CUDA Programming patterns GPU Direct RDMA GPU Direct for iomemory API Sample code Performance Conclusion 3

iomemory

Introduction delivers the world s data faster. From workstations to datacenter servers, accelerates innovative leaders like: and over 4000 other companies worldwide.

iomemory Accelerates Applications Relational Databases Server Virtualization VDI Big Data Search Analytics INFORMIX HPC Messaging Workstation Collaboration Caching Security/Logging Web MQ LAMP GPFS Lotus 6

Flash Results Are Dramatic FINANCIALS WEB TECH RETAIL GOV / MFG The world s leading Q&A site 5x FASTER DATA ANALYSIS 30x REPLICATION FASTER DATABASE 40x QUERIES FASTER DATA 15x WAREHOUSE THROUGHPUT QUERY PROCESSING 15x FASTER QUERIES 60+ case studies at http://fusionio.com/casestudies 7

iomemory Purpose Built Products ENTERPRISE HYPERSCALE WORKSTATION SCALE UP SCALE OUT SINGLE USER 8

Up to 3.0TB of capacity per PCIe slot MEZZANINE Up to 1.2 TB of capacity per slot for maximum performance density Up to 2.4TB of capacity per x8 PCIe slot Up to 3.2TB of capacity per PCIe slot Up to 1.6TB of workstation acceleration 9

Access Delay Time Server/Workstation Based Flash Disk Drives & SSDs CPU Cache DRAM Significantly faster than disk More scalable than memory Less expensive than memory Data is persistent like on disk Nanoseconds 10-9 Microseconds 10-6 Milliseconds 10-3

Cut-Through Architecture Generic SSD Disk Approach Fusion Cut-Through Approach Host CPU App OS Host CPU App OS RAID Controller PCIe SC DRAM PCIe DRAM Super Capacitors SAS Data path Controller NAND 11

High Bandwidth Scalability iomemory bandwidth scales linearly while SSDs and PCIe SSDs max out at around 2GB/s (when adding multiple devices) Read (MB/s) 8000 iofx Read SSD Read Bandwidth (MB/s) 6000 4000 2000 0 iomemory sweet spot 1 2 3 4 5 6 # of Devices Example: Eight iomemory devices in RAID0 provide a bandwidth of 12GB/s in a single system!

Latency With an average response time of 50µs iomemory will service I/O s for every 1ms of disk access latency! -OR- Disk 50µs 100µs 150µs 200µs 250µs 300µs 350µs 400µs 450µs 500µs 550µs 600µs 650µs 700µs 750µs 800µs 850µs 900µs 950µs 1000µs 25

Consistent Application Performance iomemory balances read/write performance for consistent throughput iomemory SSD Queuing behind slow writes causes SSD latency spikes 14

iodrive2 iomemory for Enterprise Pushing Performance Density Envelope with iomemory Platform Combines VSL and iomemory into an iomemory platform that takes enterprise applications and databases to the next level Consistent Low Latency Performance Industry-leading Capacity Lowest CPU Utilization Per Unit of Application Work Unrivaled Endurance Enterprise Reliability via Self-Healing, Wear Management, and Predictive Monitoring Extensive OS Support and Flexibility 15

iodrive2 Duo Heaviest Workloads Based on the same iomemory platform as the iodrive2 The iodrive2 Duo provides 2.4TB of iomemory per x4 and x8 PCI Express slot Organizations can: Handle twice the transactions, queries and requests Add capacity for future growth Ensure applications seamlessly handle larger traffic spikes Consolidate more infrastructure 16

MARKET SEGMENTS Large-scale Transaction Processing KEY DRIVERS Drivers for Hyperscale Data Centers Increase Transaction Throughput Content Distribution On-Demand Streaming Big Data Management and Meta Data Indexing Cloud Computing and Multi-tenancy Business Intelligence: Data Warehousing Lower Latency and Command Completion Times Lower Flash TCO Source: HGST Flash Memory Summit Aug 2012 17

ioscale iomemory for Hyperscale Purpose-built for the Largest Web Scale and Cloud Environments Consistent low latency and high performance Reliability through Simplicity Unrivaled endurance Tools for Flash-Smart Applications Industry-Leading performance density 410GB, 825GB, 1650GB and 3.2TB 18

iofx iomemory for Workstations Bridging the gap between creative potential and application performance Tuned for sustained performance in multithreaded applications Work on 2K, 4K and 5K content interactively, in full resolution Manipulate stereoscopic content in real-time Accelerate video and image editing and compositing Speed video playback Powerful throughput to maximize GPU processing Simplify and accelerate encoding and transcoding as well as other data-intensive activities in contemporary digital production 19

iodrive2 Specifications iodrive2 Capacity 400GB SLC 600GB SLC 365GB MLC 785GB MLC* 1.2TB MLC* 3.0TB MLC Read Bandwidth - 1MB 1.4 GB/s 1.5 GB/s 910 MB/s 1.5 GB/s 1.5 GB/s 1.5 GB/s Write Bandwidth - 1MB 1.3 GB/s 1.3 GB/s 590 MB/s 1.1 GB/s 1.3 GB/s 1.3 GB/s Ran. Read IOPS - 512B 360,000 365,000 137,000 270,000 275,000 143,000 Ran. Write IOPS - 512B 800,000 800,000 535,000 800,000 800,000 535,000 Ran. Read IOPS - 4K 270,000 290,000 110,000 215,000 245,000 136,000 Ran. Write IOPS - 4K 270,000 270,000 140,000 230,000 250,000 242,000 Read Access Latency 47µs 47µs 68µs 68µs 68µs 68µs Write Access Latency 15µs 15µs 15µs 15µs 15µs 15µs Bus Interface PCI Express 2.0 Weight 6.6 ounces 9.5 ounces Form Factor HH x HL FH x HL Warranty 5 years or maximum endurance used Supported Operating Systems Microsoft Windows 64-bit Windows Server 2012, Windows Server 2008 R2, Windows Server 2008, Windows Server 2003 Linux RHEL 5/6; SLES 10/11; OEL 5/6; CentOS 5/6; Debian Squeeze; Fedora 16/17; opensuse 12; Ubuntu 10/11/12 UNIX Solaris 10/11 x64; OpenSolaris 2009.06 x64; OSX 10.6/10.7/10.8 Hypervisors VMware ESX 4.0/4.1/ESXi 4.1/5.0/5.1, Windows 2008 R2 with Hyper-V, Hyper-V Server 2008 R2 www.fusionio.com August 2013 20

iodrive2 Duo Specifications iodrive2 Duo Capacity 2.4TB MLC* 1.2TB SLC* Read Bandwidth (1 MB) 3.0 GB/s 3.0 GB/s Write Bandwidth (1 MB) 2.5 GB/s 2.5 GB/s Ran. Read IOPS (512B) 540,000 700,000 Ran. Write IOPS (512B) 1,100,000 1,100,000 Ran. Read IOPS (4K) 480,000 580,000 Ran. Write IOPS (4K) 490,000 535,000 Read Access Latency 68µs 47µs Write Access Latency 15µs 15µs Bus Interface PCI Express 2.0 x8 electrical x8 physical Weight Less than 11 ounces Form Factor Full-height, half-length Warranty 5 years or maximum endurance used Supported Operating Systems Microsoft Windows 64-bit Windows Server 2012, Windows Server 2008 R2, Windows Server 2008, Windows Server 2003 Linux RHEL 5/6; SLES 10/11; OEL 5/6; CentOS 5/6; Debian Squeeze; Fedora 16/17; opensuse 12; Ubuntu 10/11/12 UNIX Solaris 10/11 x64; OpenSolaris 2009.06 x64; OSX 10.6/10.7/10.8 Hypervisors VMware ESX 4.0/4.1/ESXi 4.1/5.0/5.1, Windows 2008 R2 with Hyper-V, Hyper-V Server 2008 R2 www.fusionio.com August 2013 21

ioscale Specifications ioscale 410GB ioscale 825GB ioscale 1650GB ioscale 3.2TB Form Factor HHxHL HHxHL HHxHL FHxHL Read Bandwidth (1MB) 1.4 GB/s 1.4 GB/s 1.4 GB/s 1.5 GB/s Write Bandwidth (1MB) 700 MB/s 1.1 GB/s 1.1 GB/s 1.3 GB/s Random Read IOPS (512 bytes) 78K 135K 144K 115K Random Write IOPS (512 bytes) 535K 535K 535K 535K Read Access Latency 77µs 77µs 77µs 92µs Write Access Latency 19µs 19µs 19µs 19µs Power Profile < 25W < 25W < 25W < 25W PBW (typical) 2 4 8 20 Supported Operating Systems Microsoft Windows 64-bit Windows Server 2012, Windows Server 2008 R2, Windows Server 2008, Windows Server 2003 Linux RHEL 5/6; SLES 10/11; OEL 5/6; CentOS 5/6; Debian Squeeze; Fedora 16/17; opensuse 12; Ubuntu 10/11/12 UNIX Solaris 10/11 x64; OpenSolaris 2009.06 x64; OSX 10.6/10.7/10.8 Hypervisors VMware ESX 4.0/4.1/ESXi 4.1/5.0/5.1, Windows 2008 R2 with Hyper-V, Hyper-V Server 2008 R2 22

iofx Specifications iofx Capacity 420GB 1650GB NAND Type MLC (Multi Level Cell) Read Bandwidth (1MB) 1.4 GB/s 1.4GB/s Write Bandwidth (1MB) 700MB/s 1.1GB/s Read Access Latency (4K) 77µs 77µs Write Access Latency (4K) 19µs 19µs Bus Interface PCI Express 2.0 x4 (x8 physical) PCI Express 2.0 x4 (x4 physical) Bundled Mgmt Software iosphere Microsoft Windows: 64-bit Windows Server 2012, Windows Server 2008 R2, Windows 8, Windows 7 Operating Systems Warranty Linux: RHEL 5/6, SLES 10/11, CentOS 5/6, Debian Squeeze, Fedora 16/17, Ubuntu 10/11/12, OELv5/6, OpenSUSE 12 OSX: MAC OSX 10.6.7/10.7/10.8 Hypervisors: VMware ESX 4.0/4.1, VMware ESXi 4.1/5.0/5.1, Windows 2008 R2 with Hyper-V, Hyper-V Server 2008 R2 3 years or maximum endurance used www.fusionio.com August 2013 23

Datasheets http://www.fusionio.com/ 24

CUDA

CUDA and the iomemory VSL Host DRAM / Memory / Operating System and Application Memory iomemory Virtualization Tables CPU and cores Applications/Databases DATA TRANSFERS PCIe Virtual Storage Layer (VSL) Commands File System Kernel Virtual Storage Layer (VSL) iomemory iodrive Banks iomemory Data-Path Controller Channels Wide 26

iomemory -> GPU OS-pinned CUDA buffer // Alloc OS-pinned memory cudahostalloc((void**)&h_odata, memsize, (wc)? cudahostallocwritecombined : 0); Read from iomemory fd = open("/mnt/iomemoryfile", O_RDWR O_DIRECT); rc = read(fd, h_odata, memsize); Copy (DMA) to GPU cudamemcpyasync(d_idata, h_odata, memsize, cudamemcpyhosttodevice, stream); 27

iomemory -> GPU Application File System Kernel Virtual Storage Layer CUDA RT / Driver API Kernel Nvidia.ko DMA DMA 28

GPU -> iomemory OS-pinned CUDA buffer // Alloc OS-pinned memory cudahostalloc((void**)&h_odata, memsize, (wc)? cudahostallocwritecombined : 0); Copy (DMA) from GPU cudamemcpyasync(h_odata, d_idata, memsize, cudamemcpydevicetohost, stream); Write to iomemory fd = open("/mnt/iomemoryfile", O_RDWR O_DIRECT); rc = write(fd, h_odata, memsize); 29

Programming Patterns Pipelines CPU threads CUDA streams Ring buffers Parallel DMA Direct I/O 30

GPU Direct

GPU Direct https://developer.nvidia.com/gpudirect Copy data directly to/from CUDA pinned host memory Avoid one copy but not a host memory bypass Currently supported by iodrive (as described previously) Peer to peer transfers P2P DMA between PCIe GPUs Peer to peer memory access between GPUs NUMA between CUDA kernels 32

RDMA GPU Direct http://docs.nvidia.com/cuda/gpudirect-rdma/index.html 33

RDMA GPU Direct Available in Kepler-class GPUs and CUDA 5.0+ Fastest possible inter-device communication Utilizes PCIe peer-to-peer DMA transfer Avoids host memory altogether No bounce buffer nor copy-in/copy-out Third party device can directly access GPU memory Tech preview (proof of concept) for iomemory 34

RDMA GPU Direct for iomemory

iomemory <-> GPU Application File System Kernel Virtual Storage Layer CUDA RT / Driver API Kernel nvidia_get/put_p2p_pages() Nvidia.ko DMA 36

Design Considerations Supported systems Kepler-class GPUs Intel Sandy/Ivy Bridge User space tokens Query CUDA and pass to VSL Lazy unpinning Kernel driver linkage VSL uses nvidia_get/put_pages() APIs Adopt the Mellanox GPU Direct approach 37

CUDA API Data structure typedef struct CUDA_POINTER_ATTRIBUTE_P2P_TOKENS_st { unsigned long long p2ptoken; unsigned int vaspacetoken; } CUDA_POINTER_ATTRIBUTE_P2P_TOKENS; Function CUresult CUDAAPI cupointergetattribute(void *data, CUpointer_attribute attribute, CUdeviceptr pointer); Usage ptr = cudamalloc(); cupointergetattribute(&tokens, CU_POINTER_ATTRIBUTE_P2P_TOKENS, ptr); 38

VSL user space API typedef struct fusion_iovec { fio_block_range_t iov_range; ///< Range of sectors uint32_t iov_op; ///< FIOV_TRIM/WRITE/READ uint32_t iov_flags; ///< FIOV_FLAGS_* (GPU) uint64_t iov_base; ///< Mem addr (CPU or GPU) ///< Special tokens for p2p GPU transfers unsigned long long iov_p2p_token; unsigned int iov_va_space_token; unsigned int iov_padding_reserved; } fusion_iovec_t; int ufct_iovec_transfer(int fd, const struct fusion_iovec *iov, int iovcnt, uint32_t *sectors_transferred); 39

Sample Code cuda_err = cudamalloc((void**)&buf, allocate_size); cu_err = cupointergetattribute(&cuda_tokens, CU_POINTER_ATTRIBUTE_P2P_TOKENS, (Cudeviceptr)buf); iov[0].iov_range.base = sector; // iomemory block addr iov[0].iov_range.length = individual_buf_size; iov[0].iov_op = is_write? FIOV_WRITE : FIOV_READ; iov[0].iov_flags = is_gpu? FIOV_FLAGS_GPU : 0; iov[0].iov_base = buf; // GPU memory addr iov[0].iov_p2p_token = cuda_tokens.p2ptoken; iov[0].iov_va_space_token = cuda_tokens.vaspacetoken; ufct_iovec_transfer(fd, iov, iovcnt, &sectors_transferred); 40

Benchmark System details Intel Sandy Bridge NVIDIA Tesla K20 Modified CUDA bandwidth test Modified fio-iovec-transfer test 41

Modified CUDA bandwidth test 1 Bandwidth (GB/s) Single Thread 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 HARD DRIVE IOMEMORY IOM W/ GDR IOM W/ GDR W/ LAZY UNPINNING Read (iom->gpu) Write (GPU->ioM) 42

Modified fio-iovec-transfer test 2.5 Bandwidth (GB/s) with Multiple Threads 2 1.5 1 0.5 0 1 THREAD 2 THREADS 4 THREADS 8 THREADS Read (iom- >GPU) Write (GPU- >iom) Read GDR (iom- >GPU) Write GDR (GPU- >iom) 43

Conclusion

Conclusion iomemory for high performance GPU I/O Available now! Tech preview of RDMA GPU Direct for iomemory Bypass host memory to further increase I/O performance iomemory can accelerate your GPU applications Invite feedback vbrisebois@fusionio.com 45

Thank You fusionio.com DELIVERING THE WORLD S DATA. FASTER.