Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device

Similar documents
High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Can High-Performance Interconnects Benefit Memcached and Hadoop?

RDMA over Ethernet - A Preliminary Study

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

SMB Direct for SQL Server and Private Cloud

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

- An Essential Building Block for Stable and Reliable Compute Clusters

Lustre Networking BY PETER J. BRAAM

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

Enabling High performance Big Data platform with RDMA

OFA Training Program. Writing Application Programs for RDMA using OFA Software. Author: Rupert Dance Date: 11/15/

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Solution Brief July All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

RoCE vs. iwarp Competitive Analysis

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Virtualization for Cloud Computing

Configuration Maximums VMware vsphere 4.1

Deliverable Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster. LOD2 Creating Knowledge out of Interlinked Data

Mellanox Academy Online Training (E-learning)

Cloud Operating Systems for Servers

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

Your Old Stack is Slowing You Down. Ajay Patel, Vice President, Fusion Middleware

Intel Xeon +FPGA Platform for the Data Center

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

Tyche: An efficient Ethernet-based protocol for converged networked storage

High Performance VMM-Bypass I/O in Virtual Machines

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Recommended hardware system configurations for ANSYS users

Quantum StorNext. Product Brief: Distributed LAN Client

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

AIX NFS Client Performance Improvements for Databases on NAS

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

AlphaTrust PRONTO - Hardware Requirements

Configuration Maximums VMware vsphere 4.0

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vsphere 5 R E S E A R C H N O T E

[VADP OVERVIEW FOR NETBACKUP]

PARALLELS CLOUD STORAGE

Overlapping Data Transfer With Application Execution on Clusters

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

Multi-Threading Performance on Commodity Multi-Core Processors

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Enabling Technologies for Distributed Computing

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

Energy-aware Memory Management through Database Buffer Control

Enabling Technologies for Distributed and Cloud Computing

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

D1.2 Network Load Balancing

HPC performance applications on Virtual Clusters

Intel DPDK Boosts Server Appliance Performance White Paper

State of the Art Cloud Infrastructure

I3: Maximizing Packet Capture Performance. Andrew Brown

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

SOLUTION BRIEF AUGUST All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

Advancing Applications Performance With InfiniBand

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Violin: A Framework for Extensible Block-level Storage

Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand

Building a Scalable Storage with InfiniBand

Memory Aggregation For KVM Hecatonchire Project

Configuration Maximums

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

Question: 3 When using Application Intelligence, Server Time may be defined as.

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

Big Fast Data Hadoop acceleration with Flash. June 2013

Hyper-V over SMB Remote File Storage support in Windows Server 8 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

Web Server Architectures

Microsoft SMB Running Over RDMA in Windows Server 8

Running Native Lustre* Client inside Intel Xeon Phi coprocessor

Linux VM Infrastructure for memory power management

Implementing Enterprise Disk Arrays Using Open Source Software. Marc Smith Mott Community College - Flint, MI Merit Member Conference 2012

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

Stream Processing on GPUs Using Distributed Multimedia Middleware

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)

Transparent Optimization of Grid Server Selection with Real-Time Passive Network Measurements. Marcia Zangrilli and Bruce Lowekamp

ECLIPSE Performance Benchmarks and Profiling. January 2009

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Storage at a Distance; Using RoCE as a WAN Transport

High-performance vnic framework for hypervisor-based NFV with userspace vswitch Yoshihiro Nakajima, Hitoshi Masutani, Hirokazu Takahashi NTT Labs.

InterScan Web Security Virtual Appliance

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

An Analysis of 8 Gigabit Fibre Channel & 10 Gigabit iscsi in Terms of Performance, CPU Utilization & Power Consumption

Network Virtualization Technologies and their Effect on Performance

Parallels Virtuozzo Containers

Transcription:

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang Liang, Ranjit Noronha, D.K. Panda Department of Computer Science and Engineering The Ohio State University Email: {liangs,noronha,panda}@cse.ohio-state.edu

Presentation Outline Introduction Problem Statement Design Issues Performance Evaluation Conclusions and Future Work

Application Trends Applications are becoming increasingly data intensive with high memory demand Larger working set for a single application, such as data warehouse application, scientific simulation, etc. Memory resources in a single node of a cluster system may not be able to accommodate the working set in memory, while some other node may host plenty of memory unused

Utilizing Remote Memory Can we utilize those remote memory to improve the applications performance? Node 1 Node 2 CPU NIC NIC CPU page MEM Network page MEM

Motivation Emergence of commodity high performance network such as InfiniBand Offloaded transport protocol stack User-level communication bypass OS Low latency High bandwidth comparable to local memcpy performance Novel hardware features such as Remote Direct Memory Access (RDMA) with minimal host overhead Can we utilize these features to boost sequential data intensive applications? And how?

Approaches Global memory management [Feeley95] Close integration with virtual memory management Implementation Complexity and poor portability User level run-time libraries [Koussih95] Application aware interface Additional management layer in user space Remote paging[markatos96] Flexible Moderate implementation effort

Presentation Outline Background Problem Statement Design Issues Performance Evaluation Conclusions and Future Work

Problem Statement Enable InfiniBand cluster to take advantage of remote memory by remote paging Enhance the local memory hierarchy performance Deliver high performance Enable application to benefit transparently Evaluate the network performance impact Comparisons of remote paging with GigE, IPoIB and InfiniBand native communication

Presentation Outline Background Problem Statement Design Issues Performance Evaluation Conclusion and Future Work

Design Choices Kernel Level Design Pros: Transparency to applications Beneficial to processes in the system Take advantage of virtual memory system management for page management Cons: Dependency on OS Not easy to debug User Level Design Pros: Portable across different OSes Easier to debug Cons Not completely transparent to application Beneficial only to application using the user-level library High overhead with user-level signal handling

Network Block Device A software mechanism to utilize remote block based resources over network Examples: NBD, ENBD, DRBD, GNBD, etc. Often used to export remote disk resources to provide storage, such as RAID device, mirror device, etc. Use Ramdisk based Network Block Device as swapping device Seamless integration with VM for remote paging NBD a TCP implementation of Network Block Device within default kernel source tree can be used for comparison study An InfiniBand based Network Block Device needs to designed

Architecture of the remote paging system Application Application User space Memory Server Memory Server Memory Server Virtual Memory Management Kernel space Swap device Local Disk HPBD InfiniBand Network HCA HCA

Design Issues Memory registration and buffer management Registration is a costly operation compared with memcpy for small buffers Pre-registration out of the critical path needs registration for all memory pages Paging messages are upper bounded by 128KB in Linux Micro second 400 350 300 250 200 150 100 50 0 1 2 4 8 5 1 2 2 5 6 1 2 8 6 4 3 2 1 6 Memcpy Registration-Kernel Registration-User Size in Bytes 1 E + 05 6 5 53 6 3 2 76 8 1 6 38 4 8 1 92 4 0 96 2 0 48 1 0 24 Memory copy is more than 12 times faster than memory registration for one page

Design Issues (cont d) Dealing with message completions Polling based synchronous completion wastes CPU cycles None preemptive in kernel mode InfiniBand supports event based completion by registering asynchronous event handler Thread safety There could be multiple instances of the driver running, mutual exclusion is needed for shared data structures Reliability issues

Our Design RDMA based server design Page in request RDMA Write Client Request Ack Page out request Memory Server RDMA Read Request Ack

Our Design (cont d) Registered buffer pool management Use pre-register a buffer pool for page copy before communication Hybrid completion handling Register an event handler with InfiniBand transport Both client and server block, when there is no traffic Use polling scheme for bursty incoming requests

Our Design (cont d) Reliable communication Using RC services Flow control Use credit based flow control Multiple server support Distribute block across multiple servers in linear mode Server 1 Server 2 Server 3 Swap area 1 n 2n

Presentation Outline Background Problem Statement Design Issues Performance Evaluation Conclusions and Future Work

Experiment Setup Xeon 2.66GHZ Cluster with 2G DDR Memory; 40GB ST340014A Hard disk; InfiniBand Mellanox MT23108 HCA Memory size configuration: 2G for local memory test scenario 512M for swapping scenario Swapping area setup Use Ram disk on memory server as swap area

Latency Comparison Micro second 1400 1200 1000 800 600 400 200 0 1 4 16 64 256 1024 Size in Bytes 4096 16384 65536 Memcpy IB IPoIB GigE InfiniBand native communication latency for one page is 4 times faster than IPoIB and 8 times faster than GigE and 2.4 times slower than memcpy

Micro-benchmark: Execution Time 20 second 15 10 5 Memory HPBD NBD-IPoIB NBD-GigE Local-Disk 0 Network Overhead is approximately 30% for IPoIB. Using server based RDMA further improves the performance for HPBD

Quicksort Execution time 700 second 600 500 400 300 200 100 Memory HPBD NBD-IPoIB NBD-GigE Local-Disk 0 HPBD is 1.4 times slower than enough local memory and 4.7 times faster than swapping to disk

Barnes Execution time second 700 600 500 400 300 200 100 0 Memory HPBD NBD- IPoIB NBD-GigE Local- Disk For slightly oversized working set, HPBD is still 1.4 times faster than swapping to disk

Two processes of Quicksort second 16000 14000 12000 10000 8000 6000 4000 2000 0 512MB memory with disk swapping 512MB memory with 3*512MB HPBD Servers 1GB memory with 2*512MB HPBD Servers 2GB memory process1 process2 Concurrent instances of quicksort run up to 21 times faster than swapping to disk

Quicksort with multiple servers second 400 350 300 250 200 150 100 50 0 1 server-gige 1 Server-IPoIB 1 Server 2 Server 4 Servers 8 Servers 16 Servers Maintaining multiple connections does not degrade performance up to 12 servers

Presentation Outline Background Problem Statement Design Issues Performance Evaluation Conclusions and Future Work

Conclusions Remote paging is an efficient way to enable sequential applications to take advantage of remote memory Using InfiniBand for remote paging can improve the performance, compared with GigE and IPoIB. And it is comparable to system with enough local memory As network speed increase, host overhead becomes more critical for further performance improvement

Future Work Achieve zero copy along the communication path to reduce host overhead along the critical path Dynamic management of idle cluster memory for swap area allocation

Acknowledgements Our research is supported by the following organizations Current Funding support by Current Equipment donations by

Thank You! {liangs, noronha, panda}@cse.ohio-state.edu Network-Based Computing Laboratory http://nowlab.cis.ohio-state.edu/