XGE 10Gb Ethernet Performance VxWorks 6.x, Linux 2.6.x

Similar documents
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

File System Design and Implementation

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

D1.2 Network Load Balancing

PCI Express High Speed Networks. Complete Solution for High Speed Networking

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

RoCE vs. iwarp Competitive Analysis

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Getting the most TCP/IP from your Embedded Processor

3G Converged-NICs A Platform for Server I/O to Converged Networks

10G Ethernet: The Foundation for Low-Latency, Real-Time Financial Services Applications and Other, Future Cloud Applications

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

IxChariot Virtualization Performance Test Plan

1-Gigabit TCP Offload Engine

Storage at a Distance; Using RoCE as a WAN Transport

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

White Paper Solarflare High-Performance Computing (HPC) Applications

How To Test A Microsoft Vxworks Vx Works (Vxworks) And Vxwork (Vkworks) (Powerpc) (Vzworks)

Solving the Hypervisor Network I/O Bottleneck Solarflare Virtualization Acceleration

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

10Gb Ethernet: The Foundation for Low-Latency, Real-Time Financial Services Applications and Other, Latency-Sensitive Applications

Configuring Your Computer and Network Adapters for Best Performance

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

VMWARE WHITE PAPER 1

Windows TCP Chimney: Network Protocol Offload for Optimal Application Scalability and Manageability

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

RDMA over Ethernet - A Preliminary Study

Building Enterprise-Class Storage Using 40GbE

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Linux NIC and iscsi Performance over 40GbE

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

Intel DPDK Boosts Server Appliance Performance White Paper

Tyche: An efficient Ethernet-based protocol for converged networked storage

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

10GBASE T for Broad 10_Gigabit Adoption in the Data Center

SMB Direct for SQL Server and Private Cloud

Frequently Asked Questions

Networking Driver Performance and Measurement - e1000 A Case Study

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

The Advantages of Multi-Port Network Adapters in an SWsoft Virtual Environment

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

Oracle Database Scalability in VMware ESX VMware ESX 3.5

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

State of the Art Cloud Infrastructure

HA Certification Document Armari BrontaStor 822R 07/03/2013. Open-E High Availability Certification report for Armari BrontaStor 822R

Network Performance in High Performance Linux Clusters

Ultra-Low Latency, High Density 48 port Switch and Adapter Testing

Performance of Software Switching

XMC Modules. XMC-6260-CC 10-Gigabit Ethernet Interface Module with Dual XAUI Ports. Description. Key Features & Benefits

A Low-Latency Solution for High- Frequency Trading from IBM and Mellanox

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

PCI Express High Speed Networking. A complete solution for demanding Network Applications

1000Mbps Ethernet Performance Test Report

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

PE310G4BPi40-T Quad port Copper 10 Gigabit Ethernet PCI Express Bypass Server Intel based

HP Mellanox Low Latency Benchmark Report 2012 Benchmark Report

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

N5 NETWORKING BEST PRACTICES

Datacenter Operating Systems

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Enabling High performance Big Data platform with RDMA

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Gigabit Ethernet Design

Cloud Computing through Virtualization and HPC technologies

Pump Up Your Network Server Performance with HP- UX

QoS & Traffic Management

Terminal Server Software and Hardware Requirements. Terminal Server. Software and Hardware Requirements. Datacolor Match Pigment Datacolor Tools

Boosting Data Transfer with TCP Offload Engine Technology

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Monitoring high-speed networks using ntop. Luca Deri

Implementation and Performance Evaluation of M-VIA on AceNIC Gigabit Ethernet Card

How To Test Nvm Express On A Microsoft I7-3770S (I7) And I7 (I5) Ios 2 (I3) (I2) (Sas) (X86) (Amd)

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Intel I340 Ethernet Dual Port and Quad Port Server Adapters for System x Product Guide

Design Issues in a Bare PC Web Server

Performance Analysis of IPv4 v/s IPv6 in Virtual Environment Using UBUNTU

The new frontier of the DATA acquisition using 1 and 10 Gb/s Ethernet links. Filippo Costa on behalf of the ALICE DAQ group

OFA Training Program. Writing Application Programs for RDMA using OFA Software. Author: Rupert Dance Date: 11/15/

The virtualization of SAP environments to accommodate standardization and easier management is gaining momentum in data centers.

Introduction to MPIO, MCS, Trunking, and LACP

Windows Server 2008 R2 Hyper-V Live Migration

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

Microsoft Private Cloud Fast Track

A Tour of the Linux OpenFabrics Stack

Windows Server 2008 R2 Hyper-V Live Migration

Network Function Virtualization Using Data Plane Developer s Kit

Application Note. Windows 2000/XP TCP Tuning for High Bandwidth Networks. mguard smart mguard PCI mguard blade

Big Data Technologies for Ultra-High-Speed Data Transfer and Processing

Isilon IQ Scale-out NAS for High-Performance Applications

Optimize Server Virtualization with QLogic s 10GbE Secure SR-IOV

Informatica Ultra Messaging SMX Shared-Memory Transport

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

PE10G2T Dual Port Fiber 10 Gigabit Ethernet TOE PCI Express Server Adapter Broadcom based

Telecom - The technology behind

Advanced Computer Networks. High Performance Networking I

Building High-Performance iscsi SAN Configurations. An Alacritech and McDATA Technical Note

Transcription:

CRITICAL I/O WHITE PAPER XGE 1Gb Ethernet Performance VxWorks 6.x, Linux 2.6.x Abstract The XGE 1G series of 1 GbE NICs provides dual port 1 Gigabit Ethernet (1GbE) connectivity for embedded systems with the high performance characteristics that are essential for data intensive realtime systems, yet maintaining 1% interoperability and compatibility with standard Ethernet infrastructures. The drivers support three usage models: standard networking APIs, a high performance UDP Stream API, and a low latency RDMA API. This paper provides actual measured data transfer rates and CPU loading for the XGE 1G running under VxWorks 6.x and Linux 2.6.x for a variety of modes of operation, on a variety of platforms. Revision 4/1/214

XGE 1Gb Ethernet Performance VxWorks 6.x, Linux 2.6, 3.x The XGE 1G series of 1 GbE NICs provides dual port 1 Gigabit Ethernet (1GbE) connectivity to embedded systems with the high performance characteristics that are essential for data intensive real-time systems, yet maintaining 1% interoperability and compatibility with standard Ethernet infrastructures. The XGE 1G provides a balanced architecture which offers hardware acceleration for bulk data transfers with the flexibility of a programmable protocol processor to significantly improve network performance. It reduces the CPU cycles and burden required for 1 GbE networking, maximizing I/O bandwidth without sacrificing host CPU efficiency. This paper summarizes the data transfer rates and CPU loading of the XGE 1G running under VxWorks 6.x and Linux 2.6.x for a variety of modes of operation, on a variety of platforms. TCP and UDP performance of more than 85 MB/s under VxWorks and 12 MB/s under Linux have been measured. XGE 1G Driver Models The Critical I/O XGE 1G VxWorks and Linux Drivers allow user access to the 1GbE network interface through two different methods; using the standard VxWorks or Linux sockets API, or using a special high performance UDP Streaming API. The standards socket API accesses the NIC through the VxWorks/Linux network stack, similar to a normal NIC driver, while the UDP Streaming and RDMA models completely bypass the VxWorks/Linux sockets layer, instead using specialized APIs that directly accesses the XGE 1G hardware. Standard Sockets API Model The standard socket API model connects the XGE 1G driver through the VxWorks or Linux sockets interface. This allows new and existing user developed socket applications and standard network applications like FTP, Telnet, NFS etc. to make use of the XGE interface. Network performance and CPU loading is excellent, but rates are limited somewhat due to the interaction of the XGE 1G hardware with the VxWorks or Linux O/S. For VxWorks, two modes of operation are also available using the standard sockets model. The Moderated mode provides very good network performance combined with lowest CPU loading through the use of interrupt moderation and coalescing techniques. The tradeoff is slightly lower peak performance, and slightly higher transfer latencies. The Non-Moderated mode focuses on achieving the highest possible data rates and the lowest possible transfer latencies, but at the expense of higher CPU loading. Revision 4/1/214

Streaming UDP API Model UDP Streaming provides a high performance data transfers models which leverage the offload capabilities of the XGE 1G hardware. As the standard BSD Socket datagram send/receive API is very limited, access to the UDP streaming functionality is provided via a specialized UDP streaming send/receive API. This specialized API provides very high performance UDP sends and receives with low host CPU loading. For sends, the UDP streaming interface is used to send application supplied blocks of data as a stream of UDP datagrams, with the datagram size being a user specified value. Datagrams must be sized to fit within the current Ethernet frame size. The XGE offload hardware will break the application supplied blocks of data into a sequence of UDP datagrams, which relieves the host processor from the overhead of doing multiple individual datagram sends. Thus the application may pass very large blocks of data to the UDP streaming API to be sent, with no CPU involvement needed to perform that datagram sends, other that the initial send setup. For receives, the application provides a large data buffer to the UDP streaming API that is to be filled with received datagrams for a defined IP/port. The offload hardware will fill the buffer with received datagrams. The application may pass very large receive buffers to the UDP streaming API to be filled, with negligible CPU involvement required to fill the buffers with data after the initial receive setup. The data stream is delivered to the application via a series of UDP datagrams that are written directly into application data buffers after stripping off the datagram header information. RDMA API Model Remote Direct Memory Access (RDMA) is a capability that allows processor to processor data movement directly between application buffers with minimal CPU involvement. The RDMA API provides functionality to define local RDMA buffer regions which remote RDMA nodes can read and write without local CPU involvement, either directly or using RDMA based messaging. Performance Measurements VxWorks Performance Results -- Standard Sockets API The following four charts show representative performance performing TCP sends and receives, accessing the XGE 1G through the standard VxWorks socket API. The charts show data rate in green (squares, upper lines), and CPU loading in blue (diamonds, lower lines), for a variety of send or receive block sizes. This data was taken using an embedded class processor base-board to host an XGE 1G XMC. The processor board used a dual-core Intel i7 processor, running VxWorks 6.8 SMP. Standard 9K jumbo frames were used. Revision 4/1/214

The first two charts show results performing TCP sends and receives using the Moderated (interrupt coalesced) mode, while the last two charts show results TCP sends or receives using the Non-Moderated (maximum performance) mode. 7 6 5 4 3 2 1 TCP Send Performance (Intel i7, VxWorks 6.8, Jumbo Frames) Moderated Mode Send Rate Send Block Size 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % 7 TCP Receive Performance (Intel i7, VxWorks 6.8, Jumbo Frames) Moderated Mode 1% 6 5 4 3 Receive Rate 9% 8% 7% 6% 5% 4% 2 1 Receive Block Size 3% 2% 1% % Revision 4/1/214

8 TCP Send Performance (Intel i7, VxWorks 6.8, Jumbo Frames) Non-Moderated Mode 1% 7 6 5 4 Send Rate 9% 8% 7% 6% 5% 3 2 1 Send Block Size 4% 3% 2% 1% % 1 TCP Receive Performance (Intel i7, VxWorks 6.8, Jumbo Frames) Non-Moderated Mode 1% 9 8 7 6 Receive Rate 9% 8% 7% 6% 5 4 3 2 1 Receive Block Size 5% 4% 3% 2% 1% % Revision 4/1/214

VxWorks Performance Results -- Streaming UDP API The following two charts show representative performance performing large block UDP sends or receives, accessing the XGE 1G through the Streaming UDP API. The charts show data rate in green (squares, upper lines), and CPU loading in blue (diamonds, lower lines), for a variety of send or receive block sizes. This data was taken using an embedded class processor base-board to host an XGE 1G XMC. The processor board used a PowerPC processor, running VxWorks 6.5. Standard 9K jumbo frames were used. UDP Streaming Sends (PPC, VxWorks 6.5, Jumbo Frames) 1 9 8 7 6 5 4 Send Rate 1% 9% 8% 7% 6% 5% 4% 3 2 1 64K 128K 256K 512K 1M 2M 4M Send Block Size 3% 2% 1% % 1 9 8 7 6 5 4 3 2 1 UDP Streaming Receives (PPC, VxWorks 6.5, Jumbo Frames) Receive Rate 64K 128K 256K 512K 1M 2M 4M Receive Block Size 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % Revision 4/1/214

Linux Performance Results -- Standard Sockets API The following two charts show representative performance performing TCP sends or receives, accessing the XGE 1G through the standard Linux socket API. The charts show data rate in green (squares, upper lines), and CPU loading in blue (diamonds, lower lines), for a variety of send or receive block sizes. This data was taken using an embedded class processor base-board to host an XGE 1G XMC. The processor board used a quad-core Intel Core2 processor, running Linux 2.6.18. Standard 9K jumbo frames were used. TCP Send Performance (Intel Core2 quad, Linux 2.6.18, Jumbo Frames) 14 12 1 8 6 Send Rate 1% 9% 8% 7% 6% 5% 4% 4 2 Send Block Size 3% 2% 1% % TCP Receive Performance (Intel Core2 quad, Linux 2.6.18, Jumbo Frames) 14 12 1 8 6 Receive Rate 1% 9% 8% 7% 6% 5% 4% 4 2 Receive Block Size 3% 2% 1% % Revision 4/1/214

Linux Performance Results -- Streaming UDP API The following three charts show representative performance performing large block UDP sends or receives, accessing the XGE 1G through the Streaming UDP API. The charts show data rate in green (squares, upper lines), and CPU loading in blue (diamonds, lower lines), for a variety of send or receive block sizes. This data was taken using an embedded class processor base-board to host an XGE 1G XMC. The processor board used a quad-core Intel Core2 processor, running Linux 2.6.18. Standard 9K jumbo frames were used. 14 12 1 8 6 4 2 UDP Streaming Sends (Intel Core2 quad, Linux 2.6.18, Jumbo Frames) Send Rate 64K 256K 1M 4M Send Block Size 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % 12 UDP Streaming Receives (Intel Core2 quad, Linux 2.6.18, Jumbo Frames) 1% 1 8 6 Receive Rate 9% 8% 7% 6% 5% 4 2 64K 256K 1M 4M Receive Block Size 4% 3% 2% 1% % Revision 4/1/214

Performance Results Linux RDMA API The two charts below show the bandwidth, latency and CPU utilization using the RDMA over Converged Enhanced Ethernet (RoCEE) protocol. Testing was performed between a host processor board with an Intel Core2 Quad (2.82GHz) processor (Client node) and another host processor Intel Xeon (2.4GHz) processor (Server node). A Fulcrum/Intel Monaco Data Center Bridged (DCB) switch was placed between the nodes and the link operated at 1G. Both nodes were running Linux Centos 6.2 (2.6.32) operating system. Figure 1 shows the Round-Trip Time (RTT) Latency and corresponding client node CPU utilization. The RTT value is the time from sending a message of size n and receiving a response message of the same size divided by 2 (to account for the roundtrip). It is important to note that the driver provides 2 methods of receiving send and receive completions. One of these methods is to continuously poll a routine to check for completions this method is very CPU intensive but can provide the lowest latency. The other method, which is used in for this testing, is to call a blocking function that returns when a completion is received from the hardware. This second method provides slightly higher latency but significantly lower CPU loading. Latency is measure from send start to receive complete, so for larger transfer sizes the latency is dominated by the actual transfer time. Latency (us) 5 45 4 35 3 25 2 15 1 5 RTT Latency (Intel Core2, Linux 2.6.32) 1 9 8 7 6 5 4 3 2 1 CPU Utilization % 1 1 1 1 1 1 Transfer Size (Bytes) Figure 1: RDMA RTT Latency and CPU utilization Figure 2 shows the RDMA write bandwidth and CPU utilization. Bandwidth hits the 1GE link maximum at transfer sizes of about 4KB. The CPU utilization starts out around 33% for 2 Byte transfers and declines to about 2% for 64KB transfers. Revision 4/1/214

12 1 RDMA Write Performance (Intel Core2, Linux 2.6.32) 1 9 8 Rate MB/s 8 6 4 2 7 6 5 4 3 2 1 1 1 1 1 1 1 Transfer Size (Bytes) CPU Utilization % Figure 2: RDMA Write Bandwidth and CPU utilization. Revision 4/1/214