Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Similar documents

Sockets vs RDMA Interface over 10-Gigabit Networks: An In-depth analysis of the Memory Traffic Bottleneck

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

D1.2 Network Load Balancing

An Architectural study of Cluster-Based Multi-Tier Data-Centers

Performance Evaluation of InfiniBand with PCI Express

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

RDMA over Ethernet - A Preliminary Study

Gigabit Ethernet Design

How To Test A Microsoft Vxworks Vx Works (Vxworks) And Vxwork (Vkworks) (Powerpc) (Vzworks)

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Collecting Packet Traces at High Speed

Performance Characterization of a 10-Gigabit Ethernet TOE

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

VMWARE WHITE PAPER 1

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Isolating the Performance Impacts of Network Interface Cards through Microbenchmarks

Micro-Benchmark Level Performance Comparison of High-Speed Cluster Interconnects

Lustre Networking BY PETER J. BRAAM

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

PCI Express* Ethernet Networking

Boosting Data Transfer with TCP Offload Engine Technology

Putting it on the NIC: A Case Study on application offloading to a Network Interface Card (NIC)

Implementation and Performance Evaluation of M-VIA on AceNIC Gigabit Ethernet Card

Advanced Computer Networks. High Performance Networking I

High Speed I/O Server Computing with InfiniBand

Building Enterprise-Class Storage Using 40GbE

End-System Optimizations for High-Speed TCP

Networking Driver Performance and Measurement - e1000 A Case Study

Linux NIC and iscsi Performance over 40GbE

Getting the most TCP/IP from your Embedded Processor

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

The Advantages of Multi-Port Network Adapters in an SWsoft Virtual Environment

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

Performance and Recommended Use of AB545A 4-Port Gigabit Ethernet Cards

EVALUATING THE NETWORKING PERFORMANCE OF LINUX-BASED HOME ROUTER PLATFORMS FOR MULTIMEDIA SERVICES. Ingo Kofler, Robert Kuschnig, Hermann Hellwagner

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

Network Performance Optimisation and Load Balancing. Wulf Thannhaeuser

Telecom - The technology behind

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Optimizing TCP Forwarding

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Assessing the Performance of Virtualization Technologies for NFV: a Preliminary Benchmarking

PCI Express High Speed Networks. Complete Solution for High Speed Networking

Building High-Performance iscsi SAN Configurations. An Alacritech and McDATA Technical Note

Performance of Software Switching

Network Virtualization Technologies and their Effect on Performance

Intel DPDK Boosts Server Appliance Performance White Paper

IP Storage: The Challenge Ahead Prasenjit Sarkar, Kaladhar Voruganti Abstract 1 Introduction

Increasing Web Server Throughput with Network Interface Data Caching

Optimizing Network Virtualization in Xen

IEEE Congestion Management Presentation for IEEE Congestion Management Study Group

Storage at a Distance; Using RoCE as a WAN Transport

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

ECLIPSE Performance Benchmarks and Profiling. January 2009

Introduction to Intel Ethernet Flow Director and Memcached Performance

Performance of Host Identity Protocol on Nokia Internet Tablet

VXLAN Performance Evaluation on VMware vsphere 5.1

Introduction to PCI Express Positioning Information

Tyche: An efficient Ethernet-based protocol for converged networked storage

TCP Performance Re-Visited

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

PCI Technology Overview

Design and Implementation of the iwarp Protocol in Software. Dennis Dalessandro, Ananth Devulapalli, Pete Wyckoff Ohio Supercomputer Center

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

Network Performance in High Performance Linux Clusters

Architectural Breakdown of End-to-End Latency in a TCP/IP Network

1-Gigabit TCP Offload Engine

Cut I/O Power and Cost while Boosting Blade Server Performance

Wireshark in a Multi-Core Environment Using Hardware Acceleration Presenter: Pete Sanders, Napatech Inc. Sharkfest 2009 Stanford University

Networking Virtualization Using FPGAs

1000Mbps Ethernet Performance Test Report

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Performance Analysis of Network Subsystem on Virtual Desktop Infrastructure System utilizing SR-IOV NIC

Microsoft Exchange Server 2003 Deployment Considerations

Enabling High performance Big Data platform with RDMA

Building a Scalable Storage with InfiniBand

Effects of Interrupt Coalescence on Network Measurements

The Elements of GigE Vision

New!! - Higher performance for Windows and UNIX environments

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

SMB Direct for SQL Server and Private Cloud

Packet-based Network Traffic Monitoring and Analysis with GPUs

Configuring Your Computer and Network Adapters for Best Performance

LS DYNA Performance Benchmarks and Profiling. January 2009

Initial End-to-End Performance Evaluation of 10-Gigabit Ethernet

Bridging the Gap between Software and Hardware Techniques for I/O Virtualization

Computer Organization & Architecture Lecture #19

RoCE vs. iwarp Competitive Analysis

How To Improve Performance On A Linux Based Router

Transcription:

Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering Ohio State University Embedded IA Division Intel Corporation Austin, Texas

Advent of High Performance Networks Introduction and Motivation Ex: InfiniBand, 1-Gigabit Ethernet, Myrinet, etc. High Performance Protocols: VAPI / IBAL, GM Good to build new applications Not so beneficial for existing applications Built around portability: Should run on all platforms TCP/IP based sockets: A popular choice Several GENERIC optimizations proposed and implemented for TCP/IP Jacobson Optimization: Integrated Checksum-Copy [Jacob89] Header Prediction for Single Stream data transfer [Jacob89]: An analysis of TCP Processing Overhead, D. Clark, V. Jacobson, J. Romkey and H. Salwen. IEEE Communications

Generic Optimizations Insufficient! Network Speed Vs CPU 12 1 1GigE GHz and Gbps 8 6 4 Network CPU 2 1GigE 1971 1972 1974 1978 198 1982 1985 1989 1993 1995 1997 1999 2 23 Year http://www.intel.com/research/silicon/mooreslaw.htm Processor Speed DOES NOT scale with Network Speeds Protocol processing too expensive for current day systems

Network Specific Optimizations Sockets can utilize some network features Hardware support for protocol processing Interrupt Coalescing (can be considered generic) Checksum Offload (TCP stack has to modified) Insufficient! Network Specific Optimizations High Performance Sockets [shah99, balaji2] TCP Offload Engines (TOE) [shah99]: High Performance Sockets and RPC over Virtual Interface (VI) Architecture, H. Shah, C. Pu, R. S. Madukkarumukumana, In CANPC 99 [balaji2]: Impact of High Performance Sockets on Data Intensive Applications, P. Balaji, J. Wu, T. Kurc, U. Catalyurek, D. K. Panda, J. Saltz, In HPDC 3

Memory Traffic Bottleneck Offloaded Transport Layers provide some performance gains Protocol processing is offloaded; lesser host CPU overhead Better network performance for slower hosts Quite effective for 1-2 Gigabit networks Effective for faster (1-Gigabit) networks in some scenarios Memory Traffic Constraints Offloaded Transport Layers rely on the sockets interface Sockets API forces memory access operations in several scenarios Transactional protocols such as RPC, File I/O, etc. For 1-Gigabit networks memory access operations can limit network performance!

1-Gigabit Ethernet 1-Gigabit Networks Recently released as a successor in the Ethernet family Some adapters support TCP/IP checksum and Segmentation offload InfiniBand Open Industry Standard Interconnect for connecting compute and I/O nodes Provides High Performance Offloaded Transport Layer; Zero-Copy data-transfer Provides one-sided communication (RDMA, Remote Atomics) Becoming increasingly popular An example RDMA capable 1-Gigabit network

Objective New standards proposed for RDMA over IP Utilizes an offloaded TCP/IP stack on the network adapter Supports additional logic for zero-copy data transfer to the application Compatible with existing Layer 3 and 4 switches What s the impact of an RDMA interface over TCP/IP? Implications on CPU Utilization Implications on Memory Traffic Is it beneficial? We analyze these issues using InfiniBand s RDMA capabilities!

Presentation Outline Introduction and Motivation TCP/IP Control Path and Memory Traffic 1-Gigabit network performance for TCP/IP 1-Gigabit network performance for RDMA Memory Traffic Analysis for 1-Gigabit networks Conclusions and Future Work

TCP/IP Control Path (Sender Side) Application Buffer write() Return to Application Checksum and Copy Socket Buffer Post TX Kick Driver DMA Post Descriptor Driver INTR on transmit success NIC Packet Leaves Checksum, Copy and DMA are the data touching portions in TCP/IP Offloaded protocol stacks avoid checksum at the host; copy and DMA are still present

TCP/IP Control Path (Receiver Side) Application Buffer read() Application gets data Copy Socket Buffer Wait for read() DMA Driver INTR on Arrival NIC Packet Arrives Data might need to be buffered on the receiver side Pick-and-Post techniques force a memory copy on the receiver side

Memory Bus Traffic for TCP L2 $ Application Buffer and Socket written buffers back fetched to memory to L2 $ Memory CPU Appln. Buffer FSB North Bridge Memory Bus Appln. Buffer Socket Buffer Data Copy Data DMA Socket Buffer Each network byte requires 4 bytes to be transferred on the Memory Bus (unidirectional traffic) I/O Bus NIC Assuming 7% memory efficiency, TCP can support at most 4-5Gbps bidirectional on 1Gbps (4MHz/64bit FSB)

Network to Memory Traffic Ratio Application Buffer Fits in Cache Application Buffer Doesn t fit in Cache Transmit (Worst Case) 1-4 2-4 Transmit (Best Case) 1 2-4 Receive (Worst Case) 2-4 4 Receive (Best Case) 2 4 This table shows the minimum memory traffic associated with network data In reality socket buffer cache misses, control messages and noise traffic may cause these to be higher Details of other cases present in the paper

Presentation Outline Introduction and Motivation TCP/IP Control Path and Memory Traffic 1-Gigabit network performance for TCP/IP 1-Gigabit network performance for RDMA Memory Traffic Analysis for 1-Gigabit networks Conclusions and Future Work

Experimental Test-bed (1-Gig Ethernet) Two Dell26 Xeon 2.4GHz 2-way SMP node 1GB main memory (333MHz, DDR) Intel E751 Chipset 32K L1, 512K L2, 4MHz/64bit FSB PCI-X 133MHz/64bit I/O bus Intel 1GbE/Pro 1-Gigabit Ethernet adapters 8 P4 2. GHz nodes (IBM xseries 35; 8673-12X) Intel Pro/1 MT Server Gig-E adapters 256K main memory

1-Gigabit Ethernet: Latency and Bandwidth 5 45 4 Latency vs Message Size (Socket Buffer Size = 64K; MTU = 1.5K; Checksum Offloaded; PCI Burst Size = 4K) 6 5 3 25 Throughput vs Message Size (Socket Buffer Size = 64K; MTU = 16K; Checksum Offloaded; PCI Burst Size = 4K) 12 1 Latency (usec) 35 3 25 2 15 1 5 256 512 768 124 128 146 Message Size (bytes) 4 3 2 1 Throughput (Mbps) 2 15 1 5 1 2 4 8 16 3264 128 256512 1K 2K4K Message Size (bytes) 8K 16K 32K64K 128K 256K 8 6 4 2 Recv CPU Send CPU Latency Recv CPU Send CPU Bandwidth TCP/IP achieves a latency of 37us (Win Server 23) 2us on Linux About 5% CPU utilization on both platforms Peak Throughput of about 25Mbps; 8-1% CPU Utilization Application buffer is always in Cache!!

TCP Stack Pareto Analysis (64 byte) Sender Receiver Kernel Sockets Driver TCP/IP 1Gig Drivers Kernel Libraries Sockets Libraries NDIS Drivers Others Kernel Sockets Driver TCP/IP 1Gig Drivers Kernel Libraries Sockets Libraries NDIS Drivers Others Kernel, Kernel Libraries and TCP/IP contribute to the Offloadable TCP/IP stack

TCP Stack Pareto Analysis (16K byte) Sender Receiver Kernel Sockets Driver TCP/IP 1Gig Drivers Kernel Libraries Sockets Libraries NDIS Drivers Others Kernel Sockets Driver TCP/IP 1Gig Drivers Kernel Libraries Sockets Libraries NDIS Drivers Others TCP and other protocol overhead takes up most of the CPU Offload is beneficial when buffers fit into cache

TCP Stack Pareto Analysis (16K byte) Sender Receiver Kernel Sockets Driver TCP/IP 1Gig Drivers Kernel Libraries Sockets Libraries NDIS Drivers Others Kernel Sockets Driver TCP/IP 1Gig Drivers Kernel Libraries Sockets Libraries NDIS Drivers Others TCP and other protocol overhead takes up most of the CPU Offload is beneficial when buffers fit into cache

Throughput (Fan-in/Fan-out) Throughput (Mbps) 4 35 3 25 2 15 1 5 Fan-In SB = 128K; MTU=9K 14 12 1 8 6 4 2 Throughput (Mbps) 5 45 4 35 3 25 2 15 1 5 Fan-Out 12 1 8 6 4 2 1 2 4 8 Number of Clients 1 2 4 8 Number of Clients CPU Throughput CPU Throughput Peak throughput of 35Mbps for Fan-In and 42Mbps for Fan-out

Bi-Directional Throughput Bandwidth (Mbps) 45 4 35 3 25 2 15 1 5 Bi-Directional Throughput 2 4 8 Number of Nodes 16 14 12 1 8 6 4 2 CPU CPU Throughput Not the traditional Bi-directional Bandwidth test Fan-in with half the nodes and Fan-out with the other half

Presentation Outline Introduction and Motivation TCP/IP Control Path and Memory Traffic 1-Gigabit network performance for TCP/IP 1-Gigabit network performance for RDMA Memory Traffic Analysis for 1-Gigabit networks Conclusions and Future Work

Experimental Test-bed (InfiniBand) 8 SuperMicro SUPER P4DL6 nodes Xeon 2.4GHz 2-way SMP nodes 512MB main memory (DDR) PCI-X 133MHZ/64bit I/O bus Mellanox InfiniHost MT2318 DualPort 4x HCA InfiniHost SDK version.2. HCA firmware version 1.17 Mellanox InfiniScale MT43132 8-port switch (4x) Linux kernel version 2.4.7-1smp

InfiniBand RDMA: Latency and Bandwidth Latency (us) 35 3 25 2 15 Latency 1 9 8 7 6 5 4 CPU Bandwidth (Mbps) 8 7 6 5 4 3 2 Bandwidth 45 4 35 3 25 2 15 1 CPU 1 3 1 5 5 2 1 1 4 16 64 256 1K 4K 16K 64K 1 2 4 8 16 32 64 128256 512 1K 2K 4K Message Size (bytes) Message Size (bytes) RW Send CPU RW Recv CPU RR Send CPU RR Recv CPU RW CPU RR CPU RW RR RW RR Performance improvement due to hardware support and zero-copy data transfer Near zero CPU Utilization at the data sink for large messages Performance limited by PCI-X I/O bus

Presentation Outline Introduction and Motivation TCP/IP Control Path and Memory Traffic 1-Gigabit network performance for TCP/IP 1-Gigabit network performance for RDMA Memory Traffic Analysis for 1-Gigabit networks Conclusions and Future Work

Throughput test: Memory Traffic Mem/Network Ratio 4.5 4 3.5 3 2.5 2 1.5 1.5 64 1K Sender Memory Traffic 16K 64K 256K 1M 4M Message Size (bytes) Sockets RDMA Mem/Network Ratio Receiver Memory Traffic 5 4.5 4 3.5 3 2.5 2 1.5 1.5 64 1K 16K 64K 256K Message Size (bytes) 1M 4M Sockets RDMA Sockets can force up to 4 times more memory traffic compared to the network traffic RDMA allows has a ratio of 1!!

Multi-Stream Tests: Memory Traffic 42624 3499.2 Bandwidth (Mbps) 25574.4 1749.6 8524.8 Fan-In Fan-Out Bi-Dir Network BW Memory BW Sustain Memory BW (65%) Memory Traffic is significantly higher than the network traffic Comes to within 5% of the practically attainable peak memory bandwidth

Presentation Outline Introduction and Motivation TCP/IP Control Path and Memory Traffic 1-Gigabit network performance for TCP/IP 1-Gigabit network performance for RDMA Memory Traffic Analysis for 1-Gigabit networks Conclusions and Future Work

Conclusions TCP/IP performance on High Performance Networks High Performance Sockets TCP Offload Engines 1-Gigabit Networks A new dimension of complexity Memory Traffic Sockets API can require significant memory traffic Up to 4 times more than the network traffic Allows saturation on less than 35% of the network bandwidth Shows potential benefits of providing RDMA over IP Significant benefits in performance, CPU and memory traffic

Future Work Memory Traffic Analysis for 64-bit systems Potential of the L3-Cache available in some systems Evaluation of various applications Transactional (SpecWeb) Streaming (Multimedia Services)

Thank You! For more information, please visit the NBC Home Page http://nowlab.cis.ohio-state.edu Network Based Computing Laboratory, The Ohio State University

Backup Slides