Highly parallel, lock- less, user- space TCP/IP networking stack based on FreeBSD. EuroBSDCon 2013 Malta

Similar documents
Intel DPDK Boosts Server Appliance Performance White Paper

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Scaling Networking Applications to Multiple Cores

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Operating Systems Design 16. Networking: Sockets

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Performance of Software Switching

Optimizing TCP Forwarding

Application Note: AN00121 Using XMOS TCP/IP Library for UDP-based Networking

SO_REUSEPORT Scaling Techniques for Servers with High Connection Rates. Ying Cai

Software Datapath Acceleration for Stateless Packet Processing

Improving DNS performance using Stateless TCP in FreeBSD 9

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

IPv6 Challenges for Embedded Systems István Gyürki

Getting the most TCP/IP from your Embedded Processor

Midterm Exam CMPSCI 453: Computer Networks Fall 2011 Prof. Jim Kurose

Network Performance - Theory

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308

Resource Utilization of Middleware Components in Embedded Systems

An Oracle Technical White Paper November Oracle Solaris 11 Network Virtualization and Network Resource Management

Accelerating the Data Plane With the TILE-Mx Manycore Processor

A Transport Protocol for Multimedia Wireless Sensor Networks

Design and Implementation of the lwip TCP/IP Stack

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

Programmable Networking with Open vswitch

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

SIDN Server Measurements

Chapter 14 Virtual Machines

Performance Counters. Microsoft SQL. Technical Data Sheet. Overview:

White Paper Abstract Disclaimer

Introduction to Intel Ethernet Flow Director and Memcached Performance

One Server Per City: C Using TCP for Very Large SIP Servers. Kumiko Ono Henning Schulzrinne {kumiko, hgs}@cs.columbia.edu

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

Design of an Application Programming Interface for IP Network Monitoring

KeyStone Training. Multicore Navigator Overview. Overview Agenda

Network Simulation Traffic, Paths and Impairment

Open Flow Controller and Switch Datasheet

Multi Stage Filtering

RFC 2544 Performance Evaluation for a Linux Based Open Router

Application Compatibility Best Practices for Remote Desktop Services

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Accelerate In-Line Packet Processing Using Fast Queue

High-Density Network Flow Monitoring

Software Defined Networking

H MICRO CASE STUDY. Device API + IPC mechanism. Electrical and Functional characterization of HMicro s ECG patch

Integrity of In-memory Data Mirroring in Distributed Systems Tejas Wanjari EMC Data Domain

Computer Networks. Chapter 5 Transport Protocols

Measuring IP Performance. Geoff Huston Telstra

Multi-core Programming System Overview

Are Second Generation Firewalls Good for Industrial Control Systems?

High-performance vswitch of the user, by the user, for the user

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Bridging the Gap between Software and Hardware Techniques for I/O Virtualization

Session NM059. TCP/IP Programming on VMS. Geoff Bryant Process Software

Exokernel : An Operating System Architecture for Application-Level Resource Management. Joong Won Roh

Content Distribution Networks (CDN)

Course Project Documentation

An Implementation Of Multiprocessor Linux

ON THE IMPLEMENTATION OF ADAPTIVE FLOW MEASUREMENT IN THE SDN-ENABLED NETWORK: A PROTOTYPE

Wireshark in a Multi-Core Environment Using Hardware Acceleration Presenter: Pete Sanders, Napatech Inc. Sharkfest 2009 Stanford University

A Dell Technical White Paper Dell PowerConnect Team

dnstap: high speed DNS logging without packet capture Robert Edmonds Farsight Security, Inc.

OpenFlow: Load Balancing in enterprise networks using Floodlight Controller

Firewalls P+S Linux Router & Firewall 2013

C-GEP 100 Monitoring application user manual

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Red Hat Linux Internals

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

21.4 Network Address Translation (NAT) NAT concept

Datacenter Operating Systems

Removing The Linux Routing Cache

OpenFlow Based Load Balancing

10 Gbit Hardware Packet Filtering Using Commodity Network Adapters. Luca Deri Joseph Gasparakis

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

End-user Tools for Application Performance Analysis Using Hardware Counters

Resource Containers: A new facility for resource management in server systems

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Client-server Sockets

PCI Express High Speed Networks. Complete Solution for High Speed Networking

Transport Layer Protocols

Chapter 3. Internet Applications and Network Programming

SYSTEM ecos Embedded Configurable Operating System

Autonomous NetFlow Probe

Network Address Translation (NAT)

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

Chapter 11. User Datagram Protocol (UDP)

Top 10 Tips for z/os Network Performance Monitoring with OMEGAMON Session 11899

Efficient Implementation of the bare-metal Hypervisor MetalSVM for the SCC

Security vulnerabilities in the Internet and possible solutions

Agenda. sflow intro. sflow architecture. sflow config example. Summary

Infrastructure for active and passive measurements at 10Gbps and beyond

E) Modeling Insights: Patterns and Anti-patterns

Assessing the Performance of Virtualization Technologies for NFV: a Preliminary Benchmarking

Transcription:

Highly parallel, lock- less, user- space TCP/IP networking stack based on FreeBSD EuroBSDCon 2013 Malta

Networking stack Requirements High throughput Low latency ConnecLon establishments and teardowns per second SoluLon Zero- copy operalon Lock eliminalon

Hardware planorm overview Tilera TILEncore- Gx card TILE- Gx36 processor 36 Lles (cores) 4 SFP+ 10GiBi ports

MulLcore architecture overview

MulLcore architecture overview Local L1 and L2 caches Distributed L3 cache By using L2 caches of other Lles Memory homing Local homing Remote homing Hash for homing

mpipe mullcore Programmable Intelligent Packet Engine Packet header parsing Packet distribulon Packet Buffer management Load balancing CalculaLng the L4 checksum on ingress and egress traffic. Gathering packet data potenlally sca[ered across mullple buffers from Lles.

So\ware planorm overview Zero Overhead Linux (ZOL) One thread assigned to one Lle No interrupts, context switchces, syscalls Modified libpthreads Standard API, no significant code change necessary except explicit affinity se`ngs Library for direct access to mpipe buffers from user- space Also buffer allocalon roulnes

Design approach One process composed of a number of threads running on separate Lles Each Lle executes the same single threaded net- stack code. Each Lle performs the enlre processing of incoming and outgoing packets for a given TCP/IP conneclon (in a run- to- complelon fashion) A giventcp/ip conneclon is always processed by the same Lle (stalc flow affinity) using mpipe flow hash funclonality This approach has the following advantages: Data structures locking avoidance inside the stack Having smaller sets of PCBs local to each Lle speeds up lookups, crealon etc. as they are executed in parallel on different Lles OpLmal Lles cache usage

FuncLonal parlloning RX/TX TCP processing Buffer alloc/dealloc Packet queues polling Control channel Data channel (de)alloc channel APP Raw/naLve API calls ApplicaLon processing

RX/TX Lle Perform TCP processing (FreeBSD roulnes) Poll for control messages Poll for data packets (both ingress from mpipe and egress from applicalon) Manage allocalon/de- allocalon queues Local Lmer Lck

Inter- Lle channels Data CommunicaLon Channel One for each TCP conneclon Ingress and egress queues No locks at stack endpoint Serves socketbuf funclonality for the stack Control channel One for each netstack Lle Handles request like connect, listen, etc. Packet buffers allocalon/free channels One for each netstack Lle Described in greater details on later slides

App Lle (raw/nalve API) Similar to socket calls listen(), bind(), connect(), send(), receive(), etc. All calls always non- blocking Based on polling Provides addilonal roulnes for buffer manipulalons Necessary for zero- copy approach Includes buffer allocalon, expanding, shrinking, etc.

Socket- like Inspired by LwIP Implemented with raw/nalve API only Only API compalbility A handle returned when crealng a socket is not a regular descriptor Intended as a temporary easier- to- use API for the user Lower performance than raw/nalve API

Ensuring zero- copy Requirement The same packet buffer seen by the hardware, networking stack and applicalon SoluLon Dedicated memory pages accessible directly by mpipe Buffer pools for each RX/TX Lle Eliminates locks on the stack side Each buffer has to return to its original pool AllocaLon/deallocaLon can be done only by mpipe or RX/TX Lle

pktbuf aka mbuf Each packet is represented as a pktbuf (mbuf) Fixed size buffer pools managed by the hardware API roulnes to manipulate pktbufs of arbitrary size consislng of chain of fixed size buffers Two unidireclonal queues AllocaLon queue from RX/TX Lle to applicalon Lle De- allocalon queue from applicalon Lle to RX/TX Lle RX/TX Lle role Keeping the allocalon queue full Keeping the de- allocalon queue free

AllocaLon/de- allocalon Put new or reuse buffer AllocaLon queue RX/TX Tile Request a new packet buffer App Tile De- allocalon queue Free a packet buffer Free a packet buffer

Ensuring no locks inside the stack Each Lle runs only one thread Each TCP conneclon is handled by one Lle only Only single sender and/or single receiver queues AllocaLon/deallocaLon queues Data CommunicaLon Channel queues Control queues

Ensuring flow affinity Ingress mpipe calculates a hash from quadruple (source and deslnalon IP and port) of each incoming packets A modulo number of RX/TX Lles is taken from hash result Obtained number idenlfies the Lle the packet is handed over to for processing Egress The same scenario while establishing a conneclon A\er that number idenlfying the correct Lle is held within conneclon handle

1,2,.. - Ingress A,B, - Egress Data flow example

Test setup Modified ab tool for conneclon establishments and finishes per second Infinicore hardware TCP/IP tester Iperf and simple echo app for throughput Counter inside the stack for latency measurements

Throughput (1 Lle)

Throughput (1 Lle)

Throughput (8 Lles)

Throughput (8 Lles)

Throughput scaling

Latency

ConnecLon performance How many conneclons we can successfully establish and teardown in a second Most difficult to achieve About 500k/s with 16 cores Reached limit of the test environment

FreeBSD stack flexibility Fairly good in overall Majority of the TCP processing easily reusable TCP FSM, TCP CC, TCP Syncache, TCP Timers, TCP Timewait, etc. OpLmized for SMP Fine grained locking

Acknowledgements People involved in the project (all from Semihalf): Maciej Czekaj Rafał Jaworowski Tomasz Nowicki Pablo Ribalta Lorenzo Piotr Zięcik Special thanks (all from Tilera): Tom DeCanio Jici Gao Kalyan Subramanian Satheesh Velmurugan

Any queslons?