A way towards Lower Latency and Jitter

Similar documents
Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Network Virtualization Technologies and their Effect on Performance

10G Ethernet: The Foundation for Low-Latency, Real-Time Financial Services Applications and Other, Future Cloud Applications

Intel DPDK Boosts Server Appliance Performance White Paper

Performance of Software Switching

Advanced Computer Networks. Network I/O Virtualization

10Gb Ethernet: The Foundation for Low-Latency, Real-Time Financial Services Applications and Other, Latency-Sensitive Applications

How Linux kernel enables MidoNet s overlay networks for virtualized environments. LinuxTag Berlin, May 2014

Comparison of the real-time network performance of RedHawk Linux 6.0 and Red Hat operating systems

Ultra-Low Latency, High Density 48 port Switch and Adapter Testing

SDN software switch Lagopus and NFV enabled software node

Benchmarking Virtual Switches in OPNFV draft-vsperf-bmwg-vswitch-opnfv-00. Maryam Tahhan Al Morton

HP Mellanox Low Latency Benchmark Report 2012 Benchmark Report

Virtual Switching Without a Hypervisor for a More Secure Cloud

QoS & Traffic Management

Wire-speed Packet Capture and Transmission

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Low Latency Test Report Ultra-Low Latency 10GbE Switch and Adapter Testing Bruce Tolley, PhD, Solarflare

Virtual Ethernet Bridging

A Low Latency Library in FPGA Hardware for High Frequency Trading (HFT)

DPDK Summit 2014 DPDK in a Virtual World

Configuring Your Computer and Network Adapters for Best Performance

Deploying Extremely Latency-Sensitive Applications in VMware vsphere 5.5

RoCE vs. iwarp Competitive Analysis

High-performance vnic framework for hypervisor-based NFV with userspace vswitch Yoshihiro Nakajima, Hitoshi Masutani, Hirokazu Takahashi NTT Labs.

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Real-Time KVM for the Masses Unrestricted Siemens AG All rights reserved

High-performance vswitch of the user, by the user, for the user

Analysis of Open Source Drivers for IEEE WLANs

Introduction to SFC91xx ASIC Family

An Oracle Technical White Paper November Oracle Solaris 11 Network Virtualization and Network Resource Management

Intel Ethernet and Configuring Single Root I/O Virtualization (SR-IOV) on Microsoft* Windows* Server 2012 Hyper-V. Technical Brief v1.

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

White Paper Solarflare High-Performance Computing (HPC) Applications

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Introduction to Intel Ethernet Flow Director and Memcached Performance

Operating Systems Design 16. Networking: Sockets

Network Function Virtualization Packet Processing Performance of Virtualized Platforms with Linux* and Intel Architecture

Networking Driver Performance and Measurement - e1000 A Case Study

VXLAN Performance Evaluation on VMware vsphere 5.1

Virtual device passthrough for high speed VM networking

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Enhancing Hypervisor and Cloud Solutions Using Embedded Linux Iisko Lappalainen MontaVista

RCL: Software Prototype

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

Transparent Optimization of Grid Server Selection with Real-Time Passive Network Measurements. Marcia Zangrilli and Bruce Lowekamp

Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

Precision Time Protocol on Linux ~ Introduction to linuxptp

Optimizing Data Center Networks for Cloud Computing

ODP Application proof point: OpenFastPath. ODP mini-summit

Datacenter Operating Systems

Programmable Networking with Open vswitch

VMWARE WHITE PAPER 1

Precision Time Protocol (PTP/IEEE-1588)

Architectural Breakdown of End-to-End Latency in a TCP/IP Network

How To Save Power On A Server With A Power Management System On A Vsphere Vsphee V (Vsphere) Vspheer (Vpower) (Vesphere) (Vmware

Control and forwarding plane separation on an open source router. Linux Kongress in Nürnberg

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308

The Performance Analysis of Linux Networking Packet Receiving

Assessing the Performance of Virtualization Technologies for NFV: a Preliminary Benchmarking

Configuration Maximums VMware vsphere 4.0

- An Essential Building Block for Stable and Reliable Compute Clusters

Bridging the Gap between Software and Hardware Techniques for I/O Virtualization

Mellanox Academy Online Training (E-learning)

Linux NIC and iscsi Performance over 40GbE

D1.2 Network Load Balancing

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Tyche: An efficient Ethernet-based protocol for converged networked storage

Performance Evaluation of the RDMA over Ethernet (RoCE) Standard in Enterprise Data Centers Infrastructure. Abstract:

ncap: Wire-speed Packet Capture and Transmission

High Performance Packet Processing

MEASURING WIRELESS NETWORK CONNECTION QUALITY

Wave Relay System and General Project Details

MoonGen. A Scriptable High-Speed Packet Generator Design and Implementation. Paul Emmerich. January 30th, 2016 FOSDEM 2016

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Welcome to the Dawn of Open-Source Networking. Linux IP Routers Bob Gilligan

Performance of STAR-System

Xeon+FPGA Platform for the Data Center

ON THE IMPLEMENTATION OF ADAPTIVE FLOW MEASUREMENT IN THE SDN-ENABLED NETWORK: A PROTOTYPE

Network Function Virtualization: Virtualized BRAS with Linux* and Intel Architecture

A Platform Built for Server Virtualization: Cisco Unified Computing System

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

KVM Architecture Overview

Performance Tuning Guidelines

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

IOTIVITY AND EMBEDDED LINUX SUPPORT. Kishen Maloor Intel Open Source Technology Center

Enabling Linux* Network Support of Hardware Multiqueue Devices

Monitoring high-speed networks using ntop. Luca Deri

Fibre Channel over Ethernet in the Data Center: An Introduction

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

vpf_ring: Towards Wire-Speed Network Monitoring Using Virtual Machines

Design Issues in a Bare PC Web Server

Transcription:

A way towards Lower Latency and Jitter Jesse Brandeburg jesse.brandeburg@intel.com Intel Ethernet

BIO Jesse Brandeburg <jesse.brandeburg@intel.com> A senior Linux developer in the Intel LAN Access Division, producing the Intel Ethernet product lines Has been with Intel since 1994, and has worked on the Linux e100, e1000, e1000e, igb, ixgb, ixgbe drivers since 2002 Jesse splits his time between solving customer issues, performance tuning Intel's drivers, and bleeding edge development for the Linux networking stack 2

Acknowledgements Contributors Anil Vasudevan, Eric Geisler, Mike Polehn, Jason Neighbors, Alexander Duyck, Arun Ilango, Yadong Li, Eliezer Tamir 3

The speed of light sucks. - John Carmack 4

Current State NAPI is pretty good, but optimized for throughput Certain customers want extremely low end to end latency Cloud providers High Performance Computing (HPC) Financial Services Industry (FSI) The race to the lowest latency has sparked user-space stacks Most bypass the kernel stack Examples include OpenOnload application acceleration, Mellanox Messaging Accelerator (VMA), RoCEE/IBoE, RDMA/iWarp, and others [1] [1] see notes for links to above products 5

Problem Statement Latency is high by default (especially for Ethernet) Jitter is unpredictable by default Software Causes Scheduling/context switching of the process Interrupt balancing algorithms Interrupt rate settings Path length from receive to transmit Hardware Causes # of fetches from memory Latency inside the network controller Interrupt propagation Power Management (NIC, PCIe, CPU) 6

Latency and Jitter Contributors Key sources today Raw Hardware Latency Software Execution Latency Scheduling / Context Switching Interrupt Rebalancing Interrupt Moderation/Limiting Power Management Bus Utilization (jitter) Solutions New Hardware Opportunity Opportunity Interrupt-to-core mapping Minimize/Disable throttling (ITR=0) Disable (or limit) CPU power management, PCIe power management Isolate device 7

Latency and Jitter Contributors Key sources today Raw Hardware Latency Software Execution Latency Scheduling / Context Switching Interrupt Rebalancing Interrupt Moderation/Limiting Power Management Bus Utilization (jitter) Solutions New Hardware Opportunity Opportunity Interrupt-to-core mapping Minimize/Disable throttling (ITR=0) Disable (or limit) CPU power management, PCIe power management Isolate device 8

Traditional Transaction Flow 1. App transmits thru sockets API Passed down to driver and h/w unblocked TX is Fire and Forget 2. App checks for receive 3. No immediate receive thus block 4. Packet received & Interrupt generated Interrupt subject to Int Rate & Int Balancing 5. Driver passes to Protocol 6. Protocol/Sockets wakes App 7. App received data thru sockets API 8. Repeat 1 1 1 1 1 Sockets Protocols Device driver NIC 2 7 3 6 5 4 4 Very inefficient for low-latency traffic 9

10 Latency Breakdown 2.6.36

Latency Breakdown kernel v3.5 7000 v3.5 Round Trip Packet Timings Total: 5722 ns 6000 5000 4000 3000 2000 1000 0 11 480 360 529 356 3064 243 692 Tx Driver Tx Protocol Tx Socket Application Rx Socket Rx Protocol Rx Driver

Jitter Measurements min/max in us measured by netperf 10000000 1000000 100000 10000 1000 Min_latency Max_latency 100 10 1 udp_rr arx-off-1 3.9.15 tcp_rr arx-off-1 12

Jitter Measurements standard deviation measured by netperf 10000 Stddev_latency netperf 1000 100 Stddev_latency 10 1 udp_rr arx-off-1 tcp_rr arx-off-1 3.9.15 3.9.15 13

Proposed Solution Improve the software latency and jitter by driving the receive from user context Result The Low Latency Sockets proof of concept 14

Low Latency Sockets (LLS) LLS is a software initiative to reduce networking latency and jitter within the kernel Native protocol stack is enhanced with a low latency path in conjunction with packet classification (queue picking) by the NIC Transparent to applications and benefits those sensitive to unpredictable latency Top down busy-wait polling replaces interrupts for incoming packets 15

New Low-Latency Transaction Flow 1. App transmits thru sockets API Passed down to driver and h/w unblocked TX is Fire and Forget 2. App checks for data (receive) 3. Check device driver for pending packet (poll starts) 4. Meanwhile, packet received to NIC 5. Driver processes pending packet Bypasses context switch & interrupt 6. Driver passes to Protocol 7. App receives data through sockets API 8. Repeat 1 1 1 1 1 Sockets Protocols Device driver NIC 2 7 6 5 4 4 3 16

Proof of Concept Code developed on 2.6.36.2 kernel Initial numbers done with ixgbe driver from out of tree Includes lots of timing and debug code Currently reliant upon hardware flow steering one queue pair (Tx/Rx) per CPU Interrupt affinity configured 17

18 Proof of Concept Results (2.6.36.2)

Jitter Results min/max latency in us, as measured by netperf 10000000 1000000 100000 10000 1000 Min_latency Max_latency 100 10 1 udp_rr tcp_rr udp_rr tcp_rr arx-off-1 arx-off-1 arx-off-90 arx-off-90 3.9.15 3.9.15 lls-r03 lls-r03 19

Jitter Results standard deviation as measured by netperf 10000 Stddev_latency 1000 100 Stddev_latency 10 1 udp_rr tcp_rr udp_rr tcp_rr arx-off-1 arx-off-1 arx-off-90 arx-off-90 3.9.15 3.9.15 lls-r03 lls-r03 20

Possible Issues Unpalatable structure modifications struct sk_buff struct sk Dependency on driver or kernel implemented flow steering Current amount of driver code to implement Current work already in progress on a much simpler version Default enabled? How can we turn this on and off Don t want a socket option defeats the purpose Security issues? Application can now force hardware/memory reads unlikely to be an issue The new poll runs in syscall context, which should be safe but we need to be careful to not create a new vulnerability does this new implementation create other problems? 21

Current work Work in progress includes Further simplified driver using a polling thread Port of the current code to v3.5 Future work Post current v3.5 code to netdev (Q4 2012) Design and refactor based on comments Make sure new flow is measurable and debuggable 22

Code Git tree posted at: https://github.com/jbrandeb/lls.git Branches v2.6.36.2_lls Original 2.6.36.2 based prototype v3.5.1_lls Port of code to v3.5.1 stable (all features may not work yet) 23

Contact jesse.brandeburg@intel.com e1000-devel@lists.sourceforge.net netdev@vger.kernel.org 24

Summary Customers want a low latency and low jitter solution We can make one native to the kernel LLS prototype shows a possible way forward Achieved lower latency and jitter Discussion What would you do differently? Do you want to help? 25

Backup

Abstract Development-in-progress of a new in-kernel interface to allow applications to achieve lower network latency and jitter Creates a new driver interface to allow an application to drive a poll through the socket layer all the way down to the device driver Benefits are applications do not have to change Linux networking stack is not bypassed in any way Minimized latency of data to the application Much more predictable jitter The design, implementation and results from an early prototype will be shown, and current efforts to refine, refactor, and upstream the design will be discussed Affected areas include the core networking stack, and network drivers