Scaling Networking Applications to Multiple Cores



Similar documents
Intel DPDK Boosts Server Appliance Performance White Paper

SYSTEM ecos Embedded Configurable Operating System

Performance of Software Switching

CPU Scheduling Outline

Chapter 2: OS Overview

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

KeyStone Training. Multicore Navigator Overview. Overview Agenda

Software Datapath Acceleration for Stateless Packet Processing

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Synchronization. Todd C. Mowry CS 740 November 24, Topics. Locks Barriers

Advanced Core Operating System (ACOS): Experience the Performance

Client/Server and Distributed Computing

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Embedded Parallel Computing

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Router Architectures

How To Understand And Understand An Operating System In C Programming

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

Overview of Operating Systems Instructor: Dr. Tongping Liu

OpenDataPlane Introduction and Overview

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Types Of Operating Systems

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Real-Time Operating Systems for MPSoCs

Highly parallel, lock- less, user- space TCP/IP networking stack based on FreeBSD. EuroBSDCon 2013 Malta

Objectives. Chapter 5: Process Scheduling. Chapter 5: Process Scheduling. 5.1 Basic Concepts. To introduce CPU scheduling

Lecture 2 Parallel Programming Platforms

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

Linux Driver Devices. Why, When, Which, How?

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Design Issues in a Bare PC Web Server

Red Hat Linux Internals

Gigabit Ethernet Design

Programmable Networking with Open vswitch

Multi-Threading Performance on Commodity Multi-Core Processors

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

PikeOS: Multi-Core RTOS for IMA. Dr. Sergey Tverdyshev SYSGO AG , Moscow

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

White Paper Abstract Disclaimer

Virtualization is set to become a key requirement

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Question Bank Subject Name: EC Microprocessor & Microcontroller Year/Sem : II/IV

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Embedded Systems. 6. Real-Time Operating Systems

Network Virtualization Technologies and their Effect on Performance

COS 318: Operating Systems. I/O Device and Drivers. Input and Output. Definitions and General Method. Revisit Hardware

Enea Hypervisor : Facilitating Multicore Migration with the Enea Hypervisor

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

I/O Device and Drivers

Development of Type-2 Hypervisor for MIPS64 Based Systems

Operating Systems Design 16. Networking: Sockets

CPS104 Computer Organization and Programming Lecture 18: Input-Output. Robert Wagner

Real Time Programming: Concepts

Putting it on the NIC: A Case Study on application offloading to a Network Interface Card (NIC)

CPU Scheduling. Core Definitions

ioscale: The Holy Grail for Hyperscale

Bivio 7000 Series Network Appliance Platforms

ELI: Bare-Metal Performance for I/O Virtualization

Chapter 11 I/O Management and Disk Scheduling

Chapter 1: Introduction. What is an Operating System?

Going Linux on Massive Multicore

Switching Architectures for Cloud Network Designs

Effective Utilization of Multicore Processor for Unified Threat Management Functions

Operating System Components and Services

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

PCI Express High Speed Networks. Complete Solution for High Speed Networking

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

Packet Sniffer using Multicore programming. By B.A.Khivsara Assistant Professor Computer Department SNJB s KBJ COE,Chandwad

An Implementation Of Multiprocessor Linux

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

KVM Architecture Overview

Operating Systems 4 th Class

Processes and Non-Preemptive Scheduling. Otto J. Anshus

Multicore Programming with LabVIEW Technical Resource Guide

How To Use The Cisco Wide Area Application Services (Waas) Network Module

A Dell Technical White Paper Dell PowerConnect Team

Efficient Implementation of the bare-metal Hypervisor MetalSVM for the SCC

OPERATING SYSTEMS SCHEDULING

Page 1 of 5. IS 335: Information Technology in Business Lecture Outline Operating Systems

Low Latency Market Data and Ticker Plant Technology. SpryWare.

Chapter 16 Distributed Processing, Client/Server, and Clusters

Operating Systems OBJECTIVES 7.1 DEFINITION. Chapter 7. Note:

Interconnection Networks

Enabling Practical SDN Security Applications with OFX (The OpenFlow extension Framework)

CHAPTER 3 STATIC ROUTING

Next Generation Operating Systems

Embedded Systems: map to FPGA, GPU, CPU?

Real-Time Scheduling 1 / 39

Wireshark in a Multi-Core Environment Using Hardware Acceleration Presenter: Pete Sanders, Napatech Inc. Sharkfest 2009 Stanford University

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

Open Flow Controller and Switch Datasheet

Chapter 3 Operating-System Structures

Transcription:

Scaling Networking Applications to Multiple Cores Greg Seibert Sr. Technical Marketing Engineer Cavium Networks

Challenges with multi-core application performance Amdahl s Law Evaluates application performance from the perspective of running time Overall application performance scaling limited to the proportion of processing that can be done in parallel Scaling limitation intrinsically related to type of processing being done Evaluating System Performance of Networking Applications How much data can it pass How many packets per second Scaling Parallelization Networking applications provide a convenient quanta of work: The Packet Flows are mostly independent Critical Regions Per-flow data structures

Multi-core Programming Techniques Independent processes on each core Each process can maintain state in local storage and avoid shared memory contention Processes snugly-coupled via in-memory IPC mechanisms Pipelined Divide application into stages Each stage can be limited to completely fit into the instruction cache Application performance limited to throughput of any single stage Entire application requires a-priori division of operations Symmetric Multi-Processing (SMP) Same program/image running on multiple cores All instances identical and can load balance organically Classic implementations find it tricky to scale

Independent Processes on each core Communication between cores requires the use of Inter-processor Communication Mechanisms (IPC) Shared memory Inter-CPU interrupts Message queues Familiar implementation Multi-programming OS enables this paradigm on a single or multi CPU systems Processing overhead from the IPC mechanism can be significant Context switching and messaging will consume CPU cycles not contributing towards implementing the application s features

Dividing applications into pipeline stages Parallelism can be implemented as the first stage identifies the traffic and queues it to multiple second stages Each instance of the second stage can be assigned all the packets of a flow Balancing flows between second stage instances requires some tricky footwork from the first stage Each stage code size can be limited to fit into the L1 instruction cache Performance impact due to instruction cache misses can be reduced Static assignment of operations can lead to a variance in dynamic system performance Dynamic allocation of operations or number of stage instances can somewhat mitigate this effect Can require complex software

SMP - All cores able to do all things Different traffic profiles requires a different balance of processing With all application instances able to perform all processing, a dynamic balance will occur organically A single code set (image) can be developed integrating multiply designed and unit-tested modules Testing can verify each modular component performs to expectations and interface requirements System testing and verification only needs to ensure a single image is put through its paces Need to ensure critical regions are minimized Mutual Exclusion mechanisms (mutex) for protecting these regions have the ability to reduce overall application scaling

Designing for Optimal Performance Goal is to keep the CPUs busy executing the application s instructions Minimize, if not eliminate, the need for handling interrupts and context switches System calls, interrupt exceptions, and context switching take CPU cycles away from applications Highest Performance: Design a single process per CPU and use polling for I/O Maximize - through design - independent and parallel operations Keep critical regions to a minimum if not eliminate them altogether Protecting critical regions are the single largest impediment to efficient scaling

Which method to choose? No one method is intrinsically better than the other Each have their own application space Pipelines benefit Single high-bandwidth flows that require processing phases to be done atomically Symmetric Multiprocessing benefits Multiple flows that can be processed in parallel Low-latency traffic that can be processed in parallel while preserving ingress order on egress Wider range of traffic profiles can maintain performance Multi Independent process/thread applications benefit Existing multi-threaded or multi-process implementations wishing to gain performance without significant re-design Applications rely on Operating System services

How can hardware help? Perform some triage on the incoming packet traffic and give them a rough priority Then, hand it off to the software in a prioritized fashion Provide some sort of evaluation of the packet E.g. Flow identification Maintain the packet order arrival throughout its processing Execute menial tasks such as buffer management Recycling buffers that have been sent making them available for new incoming packets. Reduce, if not completely eliminate, the need to protect shared data structures Access to shared data structures are usually per-flow Hardware can ensure that

Spinlocks when high-contention locks Multi-CPU synchronization requires a memory-based contention primitive Spinlocks based upon MIPS-defined instructions: Load-Linked and Store-Conditional Statistical Nature of operation inherently unfair OCTEON s SSO can be used to implement fair locks Locking can be done non-blocking Acquire the lock while I do something useful

How OCTEON enables high-performance Applications running as Linux processes have direct access to hardware blocks via Simple Exec API Send and receive packets directly Integrated Packet Input and Output processors with knowledge of common network protocols offload software from laborious header validations PIP provides the results of these tests in the form of a set of flags Packets get flow classification on ingress PKO computes and inserts transport layer checksum on egress Hardware buffer management Processors can allocate and free buffers without software intervention Many operations executing in parallel with the dual-issue cores Software can continue to execute instructions while time consuming operations run to completion I/O units can DMA results to core s local memory Crypto instructions execute asynchronous to pipeline

How OCTEON enables high-performance Introduces a Work Flow Paradigm SSO off-loads software from the task of scheduling what operations get executed on the cores PIP works in conjunction with SSO to prioritize ingress packets as instances of work PIP classifies packets and tags them and thus the SSO can ensure the software on the cores can work on packets with out interference Polling for work alleviates the overhead of interrupt handling Completion results from application-specific coprocessors submitted as instances of work Timer events can be processed as instances of work Software can be optimized to significantly increase application s performance Hardware work scheduling independent of CPUs can eliminate the need for critical regions Utilizing Atomic Tags allows software to operate knowing it has sole access to resources Flow-based network traffic will have per-flow data structures requiring exclusive access e.g. State Machine Hardware ensures only a single packet per flow is being worked upon

OCTEON does it all OCTEON s cnmips cores can operate independently All cores share the same physical memory space so shared memory IPCs are easy to implement Each core has own mailbox interrupts Using the SSO, OCTEON can efficiently implement a pipeline Each group of cores represents a single stage in the pipeline Group switch operation passes work to the next stage Data/State passed via the Work Queue Entry structure Using the traffic classification and tagging from the PIP, the SSO can arbitrate what packets get worked on Can obviate the need to protect per-flow data structures (E.g. TCP Control Block)