Highly parallel, lock- less, user- space TCP/IP networking stack based on FreeBSD EuroBSDCon 2013 Malta
Networking stack Requirements High throughput Low latency ConnecLon establishments and teardowns per second SoluLon Zero- copy operalon Lock eliminalon
Hardware planorm overview Tilera TILEncore- Gx card TILE- Gx36 processor 36 Lles (cores) 4 SFP+ 10GiBi ports
MulLcore architecture overview
MulLcore architecture overview Local L1 and L2 caches Distributed L3 cache By using L2 caches of other Lles Memory homing Local homing Remote homing Hash for homing
mpipe mullcore Programmable Intelligent Packet Engine Packet header parsing Packet distribulon Packet Buffer management Load balancing CalculaLng the L4 checksum on ingress and egress traffic. Gathering packet data potenlally sca[ered across mullple buffers from Lles.
So\ware planorm overview Zero Overhead Linux (ZOL) One thread assigned to one Lle No interrupts, context switchces, syscalls Modified libpthreads Standard API, no significant code change necessary except explicit affinity se`ngs Library for direct access to mpipe buffers from user- space Also buffer allocalon roulnes
Design approach One process composed of a number of threads running on separate Lles Each Lle executes the same single threaded net- stack code. Each Lle performs the enlre processing of incoming and outgoing packets for a given TCP/IP conneclon (in a run- to- complelon fashion) A giventcp/ip conneclon is always processed by the same Lle (stalc flow affinity) using mpipe flow hash funclonality This approach has the following advantages: Data structures locking avoidance inside the stack Having smaller sets of PCBs local to each Lle speeds up lookups, crealon etc. as they are executed in parallel on different Lles OpLmal Lles cache usage
FuncLonal parlloning RX/TX TCP processing Buffer alloc/dealloc Packet queues polling Control channel Data channel (de)alloc channel APP Raw/naLve API calls ApplicaLon processing
RX/TX Lle Perform TCP processing (FreeBSD roulnes) Poll for control messages Poll for data packets (both ingress from mpipe and egress from applicalon) Manage allocalon/de- allocalon queues Local Lmer Lck
Inter- Lle channels Data CommunicaLon Channel One for each TCP conneclon Ingress and egress queues No locks at stack endpoint Serves socketbuf funclonality for the stack Control channel One for each netstack Lle Handles request like connect, listen, etc. Packet buffers allocalon/free channels One for each netstack Lle Described in greater details on later slides
App Lle (raw/nalve API) Similar to socket calls listen(), bind(), connect(), send(), receive(), etc. All calls always non- blocking Based on polling Provides addilonal roulnes for buffer manipulalons Necessary for zero- copy approach Includes buffer allocalon, expanding, shrinking, etc.
Socket- like Inspired by LwIP Implemented with raw/nalve API only Only API compalbility A handle returned when crealng a socket is not a regular descriptor Intended as a temporary easier- to- use API for the user Lower performance than raw/nalve API
Ensuring zero- copy Requirement The same packet buffer seen by the hardware, networking stack and applicalon SoluLon Dedicated memory pages accessible directly by mpipe Buffer pools for each RX/TX Lle Eliminates locks on the stack side Each buffer has to return to its original pool AllocaLon/deallocaLon can be done only by mpipe or RX/TX Lle
pktbuf aka mbuf Each packet is represented as a pktbuf (mbuf) Fixed size buffer pools managed by the hardware API roulnes to manipulate pktbufs of arbitrary size consislng of chain of fixed size buffers Two unidireclonal queues AllocaLon queue from RX/TX Lle to applicalon Lle De- allocalon queue from applicalon Lle to RX/TX Lle RX/TX Lle role Keeping the allocalon queue full Keeping the de- allocalon queue free
AllocaLon/de- allocalon Put new or reuse buffer AllocaLon queue RX/TX Tile Request a new packet buffer App Tile De- allocalon queue Free a packet buffer Free a packet buffer
Ensuring no locks inside the stack Each Lle runs only one thread Each TCP conneclon is handled by one Lle only Only single sender and/or single receiver queues AllocaLon/deallocaLon queues Data CommunicaLon Channel queues Control queues
Ensuring flow affinity Ingress mpipe calculates a hash from quadruple (source and deslnalon IP and port) of each incoming packets A modulo number of RX/TX Lles is taken from hash result Obtained number idenlfies the Lle the packet is handed over to for processing Egress The same scenario while establishing a conneclon A\er that number idenlfying the correct Lle is held within conneclon handle
1,2,.. - Ingress A,B, - Egress Data flow example
Test setup Modified ab tool for conneclon establishments and finishes per second Infinicore hardware TCP/IP tester Iperf and simple echo app for throughput Counter inside the stack for latency measurements
Throughput (1 Lle)
Throughput (1 Lle)
Throughput (8 Lles)
Throughput (8 Lles)
Throughput scaling
Latency
ConnecLon performance How many conneclons we can successfully establish and teardown in a second Most difficult to achieve About 500k/s with 16 cores Reached limit of the test environment
FreeBSD stack flexibility Fairly good in overall Majority of the TCP processing easily reusable TCP FSM, TCP CC, TCP Syncache, TCP Timers, TCP Timewait, etc. OpLmized for SMP Fine grained locking
Acknowledgements People involved in the project (all from Semihalf): Maciej Czekaj Rafał Jaworowski Tomasz Nowicki Pablo Ribalta Lorenzo Piotr Zięcik Special thanks (all from Tilera): Tom DeCanio Jici Gao Kalyan Subramanian Satheesh Velmurugan
Any queslons?