Scaling Networking Applications to Multiple Cores

Scaling Networking Applications to Multiple Cores Greg Seibert Sr. Technical Marketing Engineer Cavium Networks

Challenges with multi-core application performance Amdahl s Law Evaluates application performance from the perspective of running time Overall application performance scaling limited to the proportion of processing that can be done in parallel Scaling limitation intrinsically related to type of processing being done Evaluating System Performance of Networking Applications How much data can it pass How many packets per second Scaling Parallelization Networking applications provide a convenient quanta of work: The Packet Flows are mostly independent Critical Regions Per-flow data structures

Multi-core Programming Techniques Independent processes on each core Each process can maintain state in local storage and avoid shared memory contention Processes snugly-coupled via in-memory IPC mechanisms Pipelined Divide application into stages Each stage can be limited to completely fit into the instruction cache Application performance limited to throughput of any single stage Entire application requires a-priori division of operations Symmetric Multi-Processing (SMP) Same program/image running on multiple cores All instances identical and can load balance organically Classic implementations find it tricky to scale

Independent Processes on each core Communication between cores requires the use of Inter-processor Communication Mechanisms (IPC) Shared memory Inter-CPU interrupts Message queues Familiar implementation Multi-programming OS enables this paradigm on a single or multi CPU systems Processing overhead from the IPC mechanism can be significant Context switching and messaging will consume CPU cycles not contributing towards implementing the application s features

Dividing applications into pipeline stages Parallelism can be implemented as the first stage identifies the traffic and queues it to multiple second stages Each instance of the second stage can be assigned all the packets of a flow Balancing flows between second stage instances requires some tricky footwork from the first stage Each stage code size can be limited to fit into the L1 instruction cache Performance impact due to instruction cache misses can be reduced Static assignment of operations can lead to a variance in dynamic system performance Dynamic allocation of operations or number of stage instances can somewhat mitigate this effect Can require complex software

SMP - All cores able to do all things Different traffic profiles requires a different balance of processing With all application instances able to perform all processing, a dynamic balance will occur organically A single code set (image) can be developed integrating multiply designed and unit-tested modules Testing can verify each modular component performs to expectations and interface requirements System testing and verification only needs to ensure a single image is put through its paces Need to ensure critical regions are minimized Mutual Exclusion mechanisms (mutex) for protecting these regions have the ability to reduce overall application scaling

Designing for Optimal Performance Goal is to keep the CPUs busy executing the application s instructions Minimize, if not eliminate, the need for handling interrupts and context switches System calls, interrupt exceptions, and context switching take CPU cycles away from applications Highest Performance: Design a single process per CPU and use polling for I/O Maximize - through design - independent and parallel operations Keep critical regions to a minimum if not eliminate them altogether Protecting critical regions are the single largest impediment to efficient scaling

Which method to choose? No one method is intrinsically better than the other Each have their own application space Pipelines benefit Single high-bandwidth flows that require processing phases to be done atomically Symmetric Multiprocessing benefits Multiple flows that can be processed in parallel Low-latency traffic that can be processed in parallel while preserving ingress order on egress Wider range of traffic profiles can maintain performance Multi Independent process/thread applications benefit Existing multi-threaded or multi-process implementations wishing to gain performance without significant re-design Applications rely on Operating System services

How can hardware help? Perform some triage on the incoming packet traffic and give them a rough priority Then, hand it off to the software in a prioritized fashion Provide some sort of evaluation of the packet E.g. Flow identification Maintain the packet order arrival throughout its processing Execute menial tasks such as buffer management Recycling buffers that have been sent making them available for new incoming packets. Reduce, if not completely eliminate, the need to protect shared data structures Access to shared data structures are usually per-flow Hardware can ensure that

Spinlocks when high-contention locks Multi-CPU synchronization requires a memory-based contention primitive Spinlocks based upon MIPS-defined instructions: Load-Linked and Store-Conditional Statistical Nature of operation inherently unfair OCTEON s SSO can be used to implement fair locks Locking can be done non-blocking Acquire the lock while I do something useful

How OCTEON enables high-performance Applications running as Linux processes have direct access to hardware blocks via Simple Exec API Send and receive packets directly Integrated Packet Input and Output processors with knowledge of common network protocols offload software from laborious header validations PIP provides the results of these tests in the form of a set of flags Packets get flow classification on ingress PKO computes and inserts transport layer checksum on egress Hardware buffer management Processors can allocate and free buffers without software intervention Many operations executing in parallel with the dual-issue cores Software can continue to execute instructions while time consuming operations run to completion I/O units can DMA results to core s local memory Crypto instructions execute asynchronous to pipeline

How OCTEON enables high-performance Introduces a Work Flow Paradigm SSO off-loads software from the task of scheduling what operations get executed on the cores PIP works in conjunction with SSO to prioritize ingress packets as instances of work PIP classifies packets and tags them and thus the SSO can ensure the software on the cores can work on packets with out interference Polling for work alleviates the overhead of interrupt handling Completion results from application-specific coprocessors submitted as instances of work Timer events can be processed as instances of work Software can be optimized to significantly increase application s performance Hardware work scheduling independent of CPUs can eliminate the need for critical regions Utilizing Atomic Tags allows software to operate knowing it has sole access to resources Flow-based network traffic will have per-flow data structures requiring exclusive access e.g. State Machine Hardware ensures only a single packet per flow is being worked upon

OCTEON does it all OCTEON s cnmips cores can operate independently All cores share the same physical memory space so shared memory IPCs are easy to implement Each core has own mailbox interrupts Using the SSO, OCTEON can efficiently implement a pipeline Each group of cores represents a single stage in the pipeline Group switch operation passes work to the next stage Data/State passed via the Work Queue Entry structure Using the traffic classification and tagging from the PIP, the SSO can arbitrate what packets get worked on Can obviate the need to protect per-flow data structures (E.g. TCP Control Block)