Tyche: An efficient Ethernet-based protocol for converged networked storage Pilar González-Férez and Angelos Bilas 30 th International Conference on Massive Storage Systems and Technology MSST 2014 June 6, Santa Clara, California 1 / 30
1 Introduction 2 Design 3 Results 4 Conclusions and Future Directions 2 / 30
1 Introduction 2 Design 3 Results 4 Conclusions and Future Directions 3 / 30
Efficient access to networked storage Public clouds use shared storage lower cost Easier to support migration and other operations Converged storage places low-latency storage devices in all servers Storage requests exchanged between all compute servers Network protocol important for achieving high I/O throughput Modern servers increase number of cores and s Cost to access storage a concern as well Cannot use custom s or controllers in all servers Ethernet dominant technology for datacenters Lower cost and complexity single Ethernet network for storage and network data How to reduce protocol overheads for accessing remote storage over Ethernet? 4 / 30
Efficient access to networked storage (ii) Challenges Synchronization from 10s of cores to a single link Link bundling for spatial parallelism NUMA affinity Dynamic assignment of links to cores Our goal Design a networked storage access protocol that dynamically manage cores, s, NUMA affinity 5 / 30
1 Introduction 2 Design 3 Results 4 Conclusions and Future Directions 6 / 30
Our Proposal Tyche a network storage protocol that efficiently shares remote resources by using transparently several s and connections Design goals Connection-oriented protocol Edge-based communication subsystem Use Ethernet Provide RDMA-type operations without any hardware support Can be deployed in existing infrastructures Create block device local view of a remote storage device Support any existing file system 7 / 30
Netwok layer Physical devices Overview Send path (Initiator) VFS Receive path (Target) Kernel Space File System Storage device Block device Tyche block layer Tyche block layer Tyche network layer Tyche network layer Ethernet Driver Ethernet Driver 8 / 30
Design Challenges Efficiently map I/O requests to network messages Memory managment NUMA affinity Sychronization Allow high concurrency to saturate many s 9 / 30
Map I/O Requests to Network Messages Network messages Request/completion messages I/O requests and completions A request message corresponds to a single request packet Request packet transferred as small Ethernet frames (< 100 bytes) Data messages data pages RDMA operations scatter-gather list of memory pages Data packets transferred as Jumbo Ethernet frames Zero copy avoid data copy in receive path For writes interchange pages with Tyche pages For reads, interchange cannot be applied Ethernet header information about packets/messages Provide end-to-end flow-control Facilitate communication between block layer and network layer 10 / 30
Memory Management Overhead Block layer remq Queue of pre-allocated request messages Request and completion use the same message buffers damq Queue of pre-allocated descriptors for data messages Target uses pre-allocated pages avoids alloc/free Initiator uses pages of regular I/O requests 11 / 30
NUMA Affinity 0 1 2 PCIe x8 PCIe x8 Maximum throughput only with right placement Logical connection per Resources allocated on NUMA node where is attached remq damq tx_ring rx_ring not_ring Private rings Connection selected depending on location of buffers of users I/O requests Memory 0 Memory 1 Processor 0 Processor 1 Core 0 Core 1 Core 4 Core 5 QPI 0 Core 2 Core 3 Core 6 Core 7 QPI 1 QPI 1 I/O hub 0 I/O hub 1 5 4 3 12 / 30
Netwok layer Physical devices Tyche Overview Send path (Initiator) VFS Receive path (Target) Kernel Space File System Storage device Block device Tyche block layer damq remq Tyche block layer damq remq Tyche network layer tx_ring_small tx_ring_big Tyche network layer not_ring_req not_ring_data rx_ring_small rx_ring_big Ethernet Driver Ethernet Driver 13 / 30
Synchronization Overhead Context synchronization reduced for shared structures Each connection has its own private resources Network layer Three logical rings tx_ring Transmission ring rx_ring Receive ring not_ring Notification ring For each logical ring 2 different physical rings A small ring request packets A large ring data packets Each physical ring has only two sync variables: head and tail Initiator specifies fixed positions at remq and damq For each packet, the sender specifies its position in rx_ring s 14 / 30
Synchronization Overhead (ii) Block layer Network layer Ethernet driver Block layer Network layer Ethernet driver I/O request data pages I/O request data pages L L remq damq remq damq request msg A L A data msg L request msg L data msg L not_ring_req not_ring_data tx_ring_small tx_ring_big rx_ring_small rx_ring_big L tx ring rx ring Send path Receive path 15 / 30
Synchronization Overhead (iii) Many threads simultaneously issuing write requests cause lock synchronization overhead and lock contention at the level Two modes of operation Inline mode: Application context issues requests with no context switch Queue mode: Applications insert I/O requests in a Tyche queue Several threads submit network requests 16 / 30
Allow High Concurrency to Saturate Many s Tyche scales with load at initiator and target Send path Initiator uses queue mode Multiple threads place requests in a queue Tyche controls the number of threads accessing each link Target uses work queues to send I/O completions back One work queue thread per physical core Receive path Network layer one thread/ processes incoming data Block layer several threads per issue/complete requests Tested up to 6 x 10 Gbits/s 17 / 30
1 Introduction 2 Design 3 Results 4 Conclusions and Future Directions 18 / 30
Experimental Testbed Hardware & Software Two nodes 4-core Intel Xeon E5520 @2.7GHz Initiator: 12 GB DDR-III DRAM Target: 48 GB DDR-III DRAM 36 GB used as ramdisk 6 Myri10ge cards each node connected back to back CentOS 6.3 Linux kernel 2.6.32 Benchmarks: zmio, FIO, Hbase+YCSB, Psearchy, Blast,... Tyche compared to: Linux Network Block Device NBD (today) TSockets Tyche block layer using TCP/IP protocol 19 / 30
Baseline Performance zmio, 32 threads, raw device (no file system), 1 MB request size Tyche throughput scales with the number of s Tyche achieves between 82% and 92 % of throughput Tyche improves around 10x the throughput of NBD Throughput (GB/s) 7 6 5 4 3 2 Tyche TSockets NBD Throughput (GB/s) 7 6 5 4 3 2 Tyche TSockets NBD 1 1 0 1 2 3 4 5 6 # s Read requests 0 1 2 3 4 5 6 # s Write requests 20 / 30
Impact of Affinity zmio, 32 threads, raw device (no file system), 1 MB request size Tyche achieves maximum throughput only with right placement: Full-mem placement improves no affinity performance up to 97% Kmem- placement improves no affinity performance up to 54% Throughput (GB/s) 7 6 5 4 3 2 No affinity Kmem- Full-mem Throughput (GB/s) 7 6 5 4 3 2 No affinity Kmem- Full-mem 1 1 0 1 2 3 4 5 6 # s Read requests 0 1 2 3 4 5 6 # s Write requests 21 / 30
Receive Path Scaling zmio, 32 threads, raw device, 4 kb, 64 kb, and 1 MB request sizes A single thread can process requests for three s: 30 GBits/s By using a thread per : Can achieve maximum throughput Reduce receive path synchronization Throughput (GB/s) 7 6 5 4 3 2 4k-SinTh 4k-MulTh 64k-SinTh 64k-MulTh 1M-SinTh 1M-MulTh Throughput (GB/s) 7 6 5 4 3 2 4k-SinTh 4k-MulTh 64k-SinTh 64k-MulTh 1M-SinTh 1M-MulTh 1 1 0 1 2 3 4 5 6 # s Read requests 0 1 2 3 4 5 6 # s Write requests 22 / 30
Send Path Scaling FIO, XFS, 256 MB file size, several threads, each one its own file 4 kb requests: queue mode makes context switch Inline mode outperforms queue mode up to 31% 512 kb requests: inline mode synchronization overhead and lock contention Writes: queue mode outperforms inline mode up to 45% Throughput (GB/s) 2 1 0 Read-queue Read-inline Write-queue Write-inline 4 8 16 32 64 128 # Threads Throughput (GB/s) 7 6 5 4 3 2 1 0 4 8 16 32 64 128 # Threads Read-queue Read-inline Write-queue Write-inline 4 kb request size 512 kb request size 23 / 30
Queue vs. Inline Mode Overhead: 4 kb Queue mode pays context switch overhead Initiator: CPU utilization increases up to 29% Target: lower throughput CPU utilization drops up to 19% CPU utilization (sys + user) 100 80 60 40 20 0 48 16 32 64 128 # Threads Read-queue Read-inline Write-queue Write-inline CPU utilization (sys + user) 100 80 60 40 20 0 48 16 32 64 128 # Threads Read-queue Read-inline Write-queue Write-inline Initiator, 4 kb request size Target, 4 kb request size 24 / 30
Queue vs. Inline Mode Overhead: 512 kb Writes: inline mode synchronization overhead and lock contention Initiator: CPU utilization increases up to 30% Target: lower throughput CPU utilization drops up to 40% 100 100 CPU utilization (sys + user) 80 60 40 20 0 48 16 32 64 128 # Threads Read-queue Read-inline Write-queue Write-inline CPU utilization (sys + user) 80 60 40 20 0 48 16 32 64 128 # Threads Read-queue Read-inline Write-queue Write-inline Initiator, 512 kb request size Target, 512 kb request size 25 / 30
Other benchmarks Tyche always performs better than NBD and TSockets Throughput (MB/s) Tyche NBD TSockets # s 1 6 1 1 6 Psearchy 1,154 4,117 499 488 1,724 Blast 775 882 438 391 564 IOR-R 512k 573 1,670 212 226 745 IOR-W 512k 603 1,670 230 243 751 HBase-Read 303 295 154 168 229 HBase-Insert 106 112 99 54 92 26 / 30
Conclusions and Future Work Conclusions Tyche networked storage protocol Transparently use multiple s and multiple connections Address contention, memory mgmt, and network ordering Address NUMA affinity issues Achieve scalable throughput Reads: up to 6.4 GBytes/s ( 7 max) Writes: up to 6.7 GBytes/s ( 7 max) Significantly outperform NBD and TSockets Future Directions Consider how can co-exist with other network protocols over Ethernet 27 / 30
Tyche: An efficient Ethernet-based protocol for converged networked storage Pilar González-Férez and Angelos Bilas pilar@ditec.um.es bilas@ics.forth.gr FP7-ICT-610509 28 / 30
Send Path Overview Block layer Network layer Ethernet driver Block layer Network layer Ethernet driver I/O request data pages I/O request data pages 1 2 1 2 remq damq remq damq request msg 5 data msg 3 request msg 3 tx_ring_small tx_ring_big tx_ring_small 6 4 4 tx ring Write requests tx ring Read requests 29 / 30
Receive Path Overview Block layer Network layer Ethernet driver Block layer Network layer Ethernet driver I/O request data pages I/O request data pages remq damq remq damq request msg data msg request msg not_ring_req not_ring_data not_ring_req rx_ring_small rx_ring_big rx_ring_small rx ring Write requests rx ring Read requests 30 / 30