Tyche: An efficient Ethernet-based protocol for converged networked storage

Similar documents

Tyche: An efficient Ethernet-based protocol for converged networked storage

D1.2 Network Load Balancing

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

SMB Direct for SQL Server and Private Cloud

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Performance of Software Switching

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Jae Woo Choi, Dong In Shin, Young Jin Yu, Hyunsang Eom, Heon Young Yeom. Seoul National Univ. TAEJIN INFOTECH

A Packet Forwarding Method for the ISCSI Virtualization Switch

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

ECLIPSE Performance Benchmarks and Profiling. January 2009

Intel DPDK Boosts Server Appliance Performance White Paper

Network Virtualization Technologies and their Effect on Performance

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

LS DYNA Performance Benchmarks and Profiling. January 2009

High-performance vswitch of the user, by the user, for the user

High-Density Network Flow Monitoring

Intel Xeon Processor 5560 (Nehalem EP)

Big Data Technologies for Ultra-High-Speed Data Transfer and Processing

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

AIX NFS Client Performance Improvements for Databases on NAS

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Chronicle: Capture and Analysis of NFS Workloads at Line Rate

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

Windows 8 SMB 2.2 File Sharing Performance

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Linux NIC and iscsi Performance over 40GbE

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Packet-based Network Traffic Monitoring and Analysis with GPUs

Assessing the Performance of Virtualization Technologies for NFV: a Preliminary Benchmarking

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

Benchmarking Cassandra on Violin

Building Enterprise-Class Storage Using 40GbE

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Demartek June Broadcom FCoE/iSCSI and IP Networking Adapter Evaluation. Introduction. Evaluation Environment

Performance Characteristics of Large SMP Machines

Cray DVS: Data Virtualization Service

The Transition to PCI Express* for Client SSDs

The proliferation of the raw processing

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

File System Design and Implementation

An Analysis of 8 Gigabit Fibre Channel & 10 Gigabit iscsi in Terms of Performance, CPU Utilization & Power Consumption

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

RoCE vs. iwarp Competitive Analysis

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Datacenter Operating Systems

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

Variations in Performance and Scalability when Migrating n-tier Applications to Different Clouds

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

Enabling Technologies for Distributed and Cloud Computing

Network Attached Storage. Jinfeng Yang Oct/19/2015

Enabling Technologies for Distributed Computing

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Networking Virtualization Using FPGAs

Virtualised MikroTik

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Storage at a Distance; Using RoCE as a WAN Transport

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

- An Essential Building Block for Stable and Reliable Compute Clusters

POSIX and Object Distributed Storage Systems

RDMA over Ethernet - A Preliminary Study

Accelerate SQL Server 2014 AlwaysOn Availability Groups with Seagate. Nytro Flash Accelerator Cards

Next Generation Operating Systems

How System Settings Impact PCIe SSD Performance

Implementation and Performance Evaluation of M-VIA on AceNIC Gigabit Ethernet Card

Shared Parallel File System

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Benchmarking Hadoop & HBase on Violin

GigE Vision cameras and network performance

10G Ethernet: The Foundation for Low-Latency, Real-Time Financial Services Applications and Other, Future Cloud Applications

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

1-Gigabit TCP Offload Engine

Windows Server 2008 R2 Hyper-V Live Migration

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

COS 318: Operating Systems. Virtual Machine Monitors

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Building High-Performance iscsi SAN Configurations. An Alacritech and McDATA Technical Note

SALSA Flash-Optimized Software-Defined Storage

A Performance Analysis of the iscsi Protocol

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

Boosting Data Transfer with TCP Offload Engine Technology

Performance Guideline for syslog-ng Premium Edition 5 LTS

Performance Analysis of Large Receive Offload in a Xen Virtualized System

Network Traffic Monitoring & Analysis with GPUs

Dell Microsoft SQL Server 2008 Fast Track Data Warehouse Performance Characterization

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

Network Performance Optimisation and Load Balancing. Wulf Thannhaeuser

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

Windows Server 2008 R2 Hyper-V Live Migration

MIDeA: A Multi-Parallel Intrusion Detection Architecture

Performance Analysis of IPv4 v/s IPv6 in Virtual Environment Using UBUNTU

Infrastructure Matters: POWER8 vs. Xeon x86

Transcription:

Tyche: An efficient Ethernet-based protocol for converged networked storage Pilar González-Férez and Angelos Bilas 30 th International Conference on Massive Storage Systems and Technology MSST 2014 June 6, Santa Clara, California 1 / 30

1 Introduction 2 Design 3 Results 4 Conclusions and Future Directions 2 / 30

1 Introduction 2 Design 3 Results 4 Conclusions and Future Directions 3 / 30

Efficient access to networked storage Public clouds use shared storage lower cost Easier to support migration and other operations Converged storage places low-latency storage devices in all servers Storage requests exchanged between all compute servers Network protocol important for achieving high I/O throughput Modern servers increase number of cores and s Cost to access storage a concern as well Cannot use custom s or controllers in all servers Ethernet dominant technology for datacenters Lower cost and complexity single Ethernet network for storage and network data How to reduce protocol overheads for accessing remote storage over Ethernet? 4 / 30

Efficient access to networked storage (ii) Challenges Synchronization from 10s of cores to a single link Link bundling for spatial parallelism NUMA affinity Dynamic assignment of links to cores Our goal Design a networked storage access protocol that dynamically manage cores, s, NUMA affinity 5 / 30

1 Introduction 2 Design 3 Results 4 Conclusions and Future Directions 6 / 30

Our Proposal Tyche a network storage protocol that efficiently shares remote resources by using transparently several s and connections Design goals Connection-oriented protocol Edge-based communication subsystem Use Ethernet Provide RDMA-type operations without any hardware support Can be deployed in existing infrastructures Create block device local view of a remote storage device Support any existing file system 7 / 30

Netwok layer Physical devices Overview Send path (Initiator) VFS Receive path (Target) Kernel Space File System Storage device Block device Tyche block layer Tyche block layer Tyche network layer Tyche network layer Ethernet Driver Ethernet Driver 8 / 30

Design Challenges Efficiently map I/O requests to network messages Memory managment NUMA affinity Sychronization Allow high concurrency to saturate many s 9 / 30

Map I/O Requests to Network Messages Network messages Request/completion messages I/O requests and completions A request message corresponds to a single request packet Request packet transferred as small Ethernet frames (< 100 bytes) Data messages data pages RDMA operations scatter-gather list of memory pages Data packets transferred as Jumbo Ethernet frames Zero copy avoid data copy in receive path For writes interchange pages with Tyche pages For reads, interchange cannot be applied Ethernet header information about packets/messages Provide end-to-end flow-control Facilitate communication between block layer and network layer 10 / 30

Memory Management Overhead Block layer remq Queue of pre-allocated request messages Request and completion use the same message buffers damq Queue of pre-allocated descriptors for data messages Target uses pre-allocated pages avoids alloc/free Initiator uses pages of regular I/O requests 11 / 30

NUMA Affinity 0 1 2 PCIe x8 PCIe x8 Maximum throughput only with right placement Logical connection per Resources allocated on NUMA node where is attached remq damq tx_ring rx_ring not_ring Private rings Connection selected depending on location of buffers of users I/O requests Memory 0 Memory 1 Processor 0 Processor 1 Core 0 Core 1 Core 4 Core 5 QPI 0 Core 2 Core 3 Core 6 Core 7 QPI 1 QPI 1 I/O hub 0 I/O hub 1 5 4 3 12 / 30

Netwok layer Physical devices Tyche Overview Send path (Initiator) VFS Receive path (Target) Kernel Space File System Storage device Block device Tyche block layer damq remq Tyche block layer damq remq Tyche network layer tx_ring_small tx_ring_big Tyche network layer not_ring_req not_ring_data rx_ring_small rx_ring_big Ethernet Driver Ethernet Driver 13 / 30

Synchronization Overhead Context synchronization reduced for shared structures Each connection has its own private resources Network layer Three logical rings tx_ring Transmission ring rx_ring Receive ring not_ring Notification ring For each logical ring 2 different physical rings A small ring request packets A large ring data packets Each physical ring has only two sync variables: head and tail Initiator specifies fixed positions at remq and damq For each packet, the sender specifies its position in rx_ring s 14 / 30

Synchronization Overhead (ii) Block layer Network layer Ethernet driver Block layer Network layer Ethernet driver I/O request data pages I/O request data pages L L remq damq remq damq request msg A L A data msg L request msg L data msg L not_ring_req not_ring_data tx_ring_small tx_ring_big rx_ring_small rx_ring_big L tx ring rx ring Send path Receive path 15 / 30

Synchronization Overhead (iii) Many threads simultaneously issuing write requests cause lock synchronization overhead and lock contention at the level Two modes of operation Inline mode: Application context issues requests with no context switch Queue mode: Applications insert I/O requests in a Tyche queue Several threads submit network requests 16 / 30

Allow High Concurrency to Saturate Many s Tyche scales with load at initiator and target Send path Initiator uses queue mode Multiple threads place requests in a queue Tyche controls the number of threads accessing each link Target uses work queues to send I/O completions back One work queue thread per physical core Receive path Network layer one thread/ processes incoming data Block layer several threads per issue/complete requests Tested up to 6 x 10 Gbits/s 17 / 30

1 Introduction 2 Design 3 Results 4 Conclusions and Future Directions 18 / 30

Experimental Testbed Hardware & Software Two nodes 4-core Intel Xeon E5520 @2.7GHz Initiator: 12 GB DDR-III DRAM Target: 48 GB DDR-III DRAM 36 GB used as ramdisk 6 Myri10ge cards each node connected back to back CentOS 6.3 Linux kernel 2.6.32 Benchmarks: zmio, FIO, Hbase+YCSB, Psearchy, Blast,... Tyche compared to: Linux Network Block Device NBD (today) TSockets Tyche block layer using TCP/IP protocol 19 / 30

Baseline Performance zmio, 32 threads, raw device (no file system), 1 MB request size Tyche throughput scales with the number of s Tyche achieves between 82% and 92 % of throughput Tyche improves around 10x the throughput of NBD Throughput (GB/s) 7 6 5 4 3 2 Tyche TSockets NBD Throughput (GB/s) 7 6 5 4 3 2 Tyche TSockets NBD 1 1 0 1 2 3 4 5 6 # s Read requests 0 1 2 3 4 5 6 # s Write requests 20 / 30

Impact of Affinity zmio, 32 threads, raw device (no file system), 1 MB request size Tyche achieves maximum throughput only with right placement: Full-mem placement improves no affinity performance up to 97% Kmem- placement improves no affinity performance up to 54% Throughput (GB/s) 7 6 5 4 3 2 No affinity Kmem- Full-mem Throughput (GB/s) 7 6 5 4 3 2 No affinity Kmem- Full-mem 1 1 0 1 2 3 4 5 6 # s Read requests 0 1 2 3 4 5 6 # s Write requests 21 / 30

Receive Path Scaling zmio, 32 threads, raw device, 4 kb, 64 kb, and 1 MB request sizes A single thread can process requests for three s: 30 GBits/s By using a thread per : Can achieve maximum throughput Reduce receive path synchronization Throughput (GB/s) 7 6 5 4 3 2 4k-SinTh 4k-MulTh 64k-SinTh 64k-MulTh 1M-SinTh 1M-MulTh Throughput (GB/s) 7 6 5 4 3 2 4k-SinTh 4k-MulTh 64k-SinTh 64k-MulTh 1M-SinTh 1M-MulTh 1 1 0 1 2 3 4 5 6 # s Read requests 0 1 2 3 4 5 6 # s Write requests 22 / 30

Send Path Scaling FIO, XFS, 256 MB file size, several threads, each one its own file 4 kb requests: queue mode makes context switch Inline mode outperforms queue mode up to 31% 512 kb requests: inline mode synchronization overhead and lock contention Writes: queue mode outperforms inline mode up to 45% Throughput (GB/s) 2 1 0 Read-queue Read-inline Write-queue Write-inline 4 8 16 32 64 128 # Threads Throughput (GB/s) 7 6 5 4 3 2 1 0 4 8 16 32 64 128 # Threads Read-queue Read-inline Write-queue Write-inline 4 kb request size 512 kb request size 23 / 30

Queue vs. Inline Mode Overhead: 4 kb Queue mode pays context switch overhead Initiator: CPU utilization increases up to 29% Target: lower throughput CPU utilization drops up to 19% CPU utilization (sys + user) 100 80 60 40 20 0 48 16 32 64 128 # Threads Read-queue Read-inline Write-queue Write-inline CPU utilization (sys + user) 100 80 60 40 20 0 48 16 32 64 128 # Threads Read-queue Read-inline Write-queue Write-inline Initiator, 4 kb request size Target, 4 kb request size 24 / 30

Queue vs. Inline Mode Overhead: 512 kb Writes: inline mode synchronization overhead and lock contention Initiator: CPU utilization increases up to 30% Target: lower throughput CPU utilization drops up to 40% 100 100 CPU utilization (sys + user) 80 60 40 20 0 48 16 32 64 128 # Threads Read-queue Read-inline Write-queue Write-inline CPU utilization (sys + user) 80 60 40 20 0 48 16 32 64 128 # Threads Read-queue Read-inline Write-queue Write-inline Initiator, 512 kb request size Target, 512 kb request size 25 / 30

Other benchmarks Tyche always performs better than NBD and TSockets Throughput (MB/s) Tyche NBD TSockets # s 1 6 1 1 6 Psearchy 1,154 4,117 499 488 1,724 Blast 775 882 438 391 564 IOR-R 512k 573 1,670 212 226 745 IOR-W 512k 603 1,670 230 243 751 HBase-Read 303 295 154 168 229 HBase-Insert 106 112 99 54 92 26 / 30

Conclusions and Future Work Conclusions Tyche networked storage protocol Transparently use multiple s and multiple connections Address contention, memory mgmt, and network ordering Address NUMA affinity issues Achieve scalable throughput Reads: up to 6.4 GBytes/s ( 7 max) Writes: up to 6.7 GBytes/s ( 7 max) Significantly outperform NBD and TSockets Future Directions Consider how can co-exist with other network protocols over Ethernet 27 / 30

Tyche: An efficient Ethernet-based protocol for converged networked storage Pilar González-Férez and Angelos Bilas pilar@ditec.um.es bilas@ics.forth.gr FP7-ICT-610509 28 / 30

Send Path Overview Block layer Network layer Ethernet driver Block layer Network layer Ethernet driver I/O request data pages I/O request data pages 1 2 1 2 remq damq remq damq request msg 5 data msg 3 request msg 3 tx_ring_small tx_ring_big tx_ring_small 6 4 4 tx ring Write requests tx ring Read requests 29 / 30

Receive Path Overview Block layer Network layer Ethernet driver Block layer Network layer Ethernet driver I/O request data pages I/O request data pages remq damq remq damq request msg data msg request msg not_ring_req not_ring_data not_ring_req rx_ring_small rx_ring_big rx_ring_small rx ring Write requests rx ring Read requests 30 / 30