FlexPath Network Processor Rainer Ohlendorf Thomas Wild Andreas Herkersdorf Prof. Dr. Andreas Herkersdorf Arcisstraße 21 80290 München http://www.lis.ei.tum.de
Agenda FlexPath Introduction Work Packages (2nd Phase) Multi-Processor Extensions Load Balancing Techniques in FlexPath Dedicated Assignment Packet Spraying FlexPath Implementation / Demonstrator Project Highlights and Outlook FlexPath NP - 2
Basic Ideas of FlexPath NP Flexible packet processing paths on-chip Hardware decision on packet path depending on networking application / protocol Per-packet analysis in real-time Run-time reconfigurable rule base Supports varying traffic patterns over time Hardware-offload relieves programmable resources Ingress Hardware Processing Pipeline Egress Hardware Processing Pipeline AutoRoute-Path CPU Path FlexPath NP - 3
Work Packages of Second Phase MPSoC Enhancements How to notify Processing Elements (CPU, HW Acc.)? Multi-Processor Interrupt- Controller / Packet Distributor Software / Drivers Load Balancing Techniques How to get the optimal ressource utilization? Evaluation of different balancing concepts Mapping on FlexPath NP FPGA Demonstrator Fully working Network Processor on FPGA Demonstrator Plattform Demonstration tomorrow! FlexPath NP - 4
Multi-Processor Extensions Fully Multi-Processor ready Interrupt Controller Based on Xilinx Single Processor IntC Extended by several configuration registers (e.g. to configure queue- CPU assignment) Atomic register read out (instead of read/modify/write-back) Packet Distributor Priority based Supports Dedicated Assignment and Spraying CPU 3 CPU 2 CPU 1 Packet Distributor CPU 0 IDLE BUSY Path Dispatcher FlexPath NP - 5
Load Balancing Techniques Load Balancing is actually not a new problem: Cluster load balancing in HPC but also for NPUs So what's in the NP literature? Simple Hashing & Packet Spraying for "aggressive flows" (Dittmann, 2002) Adaptive HRW Hashing (Kencl, 2003) Adaptive Burst Shifting (Shi, 2005) HABS (Kencl/Shi, 2006) All known techniques applied uniformly to all packets homogeneous balancing no QoS consideration Important side effect: Flow reassignments may lead to packet reordering Packet reordering significantly reduces the efficiency of TCP (>90% of all network packets) FlexPath NP - 6
Load Balancing - Networking Applications IP Router: 90% of traffic uses TCP 50% of TCP packets are ACKs <10% of packets are DSCP marked Plain routing is "stateless": each packet is independent from other packets QoS behavior (DSCP) preferential treatment hot candidate for AutoRoute Security / Crypto Gateways: Either requires high processing per packet ( N x 1000 instr / pkt), or invocation of dedicated HW accelerators / coprocessors IPSec is a stateful protocol connection parameters (keys, etc.) sequence numbers processing on same instance optimal may require HW/SW partitioning and use of crypto cores IPsec Tunnel VPN GW GW Corporate Corporate NW NW 2 2 Corporate Corporate NW NW 1 1 VPN GW GW Internet Internet FlexPath NP - 7
Load Balancing - Stateless Networking Applications no shared connection state is maintained any processor can work on any packet dedicated assignment flow processor may be inefficient "bursty" traffic patterns lead to temporal overloads high balancing overhead needed to compensate overloads packet spraying is a "self-organizing" solution: assignment of packets to a single queue (difference to Dittmann) every idle CPU fetches packets from this queue QoS may be implemented by providing additional "high priority" queues Dittmann (2002): FlexPath (2009): Path Ctrl. Path Disp. Scheduler Packet Reordering Problem addressed in FlexPath by Path Control! FlexPath NP - 8
Load Balancing - Stateful Networking Applications Hashing-based assignment of flows to processors well suited for stateful processing without run-time adaptation: problem of biased hash bundle sizes (Dittmann) run-time adaptation (HRW, Kencl) relatively complex to implement HLU: simple, adaptive flow assignment hashing-based, high flow mapping persistence, low implementation complexity ρ=65% ρ=35% ρ=40% ρ=60% ρ=45% ρ=55% CPU1 CPU2 CPU1 CPU2 CPU1 CPU2 Fl. 1 Fl. 2 Fl. 4 Fl. 2 Fl. 4 Fl. 3 Fl. 4 Fl. 3 Fl. 5 Fl. 3 Fl. 5 Fl. 7 Fl. 5 Fl. 7 Fl. 6 Fl. 7 Fl. 6 Fl. 1 Fl. 6 Fl. 1 Fl. 2 FlexPath NP - 9
Load Balancing - System Simulation 100.0000% 10.0000% FlexPath: assignment for stateful traffic slightly better than Kencl's adaptive scheme Loss Rate (%) 1.0000% 0.1000% 0.0100% 0.0010% 0.0001% HABS HLU spray AHH 0.0000% FlexPath: assignment for stateless traffic achieves lossless operation with fewest CPUs! 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of PEs in Processor Cluster FlexPath NP - 10 all dedicated assignment schemes reach a minimum packet loss floor!
Average Packet Latency (µs) Technische Universität München Load Balancing - System Simulation 10,000 1,000 100 10 FlexPath: spraying with 2 priorities QoS guarantee HABS QoS Latency HABS IPSec Latency HABS BE Latency S&H QoS Latency S&H IPSec Latency S&H BE Latency Separating IPsec and Forwarding in FlexPath better Forwarding latency 5 6 7 8 9 10 11 12 13 14 15 16 Number of PEs in Processor Cluster FlexPath NP - 11
FlexPath Demonstrator - Floorplan Xilinx ML410 (Virtex-4 FX60) Slices: 18,549 (73%) BRAMs: 119 (51%) PPCs: 2 (100%) EMACs: 2 (100%) Critical Path: 9.965 ns => f max =100.351 MHz FlexPath NP - 12
FPGA Prototype - Measurement Results (1) 100% 40 100% 40 CPU Load (%) 90% 80% 70% 60% 50% 40% 30% Data Plane CPU 1 [%] IPSec [kpps] Fwd [kpps] 36 32 28 24 20 16 12 Packet Rate (kpps) CPU Load (%) 90% 80% 70% 60% 50% 40% 30% Data Plane CPU 1 [%] IPSec [kpps] Fwd [kpps] 36 32 28 24 20 16 12 Packet Rate (kpps) 20% Path Dispatcher 8 20% Path Dispatcher 8 10% 4 10% 4 0% 0 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 IPSec Traffic Share (kbit/s) 0% 0 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 IPSec Traffic Share (kbit/s) 70% CPU load with IP forwarding on single CPU core (latency: ~20 µs) small IPSec data rate dramatically increases CPU load (latency: >1 ms) IP forwarding performance declines before CPU gets saturated classification in Path Dispatcher reduces CPU load to 37% (latency: ~10 µs) IPSec packets are lost when more than 2.7 Mbps are injected FlexPath NP - 13
FPGA Prototype - Measurement Results (2) 100% 90% Data Plane Data Plane 40 36 100% 90% 40 36 CPU Load (%) 80% 70% 60% 50% 40% 30% Path Dispatcher CPU 1 [%] CPU 2 [%] IPSec [kpps] Fw d [kpps] 32 28 24 20 16 12 Packet Rate (kpps) CPU Load (%) 80% 70% 60% 50% 40% 30% Data Plane Path Dispatcher CPU 1 [%] IPSec [kpps] Fw d [kpps] 32 28 24 20 16 12 Packet Rate (kpps) 20% 8 20% 8 10% 4 10% 4 0% 0 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 IPSec Traffic Share (kbit/s) 0% 0 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 IPSec Traffic Share (kbps) forwarding traffic is either processed exclusively by second CPU or traffic is sprayed among both CPUs no more forwarding traffic is lost instead of using second CPU, forwarding traffic may also be AutoRouted... throughput chart identical to 2 CPU case AutoRoute latency ~0.5 µs vs. 10 µs FlexPath NP - 14
Demonstrator - NPU HW Enablement Overheads Parsing / Context builder: 2660 slices 5 BRAMs Classification / Balancing: 1446 slices, 14 BRAMs Manipulation: 1615 slices 3 BRAMs MicroBlaze CPU: 1533 slices, 4 BRAMs MicroBlaze CPU MicroBlaze CPU MicroBlaze CPU Forwarding performance: up to 3.2 Gbps (32 Bit @ 100 MHz), independent of packet size Forwarding performance: ~95 kpps, 366 Mbps* *assuming average packet size of 481 Byte (IMIX) FlexPath NP - 15
Project Highlights New Architectural Approach for Network Processors Hardware offload possibilites Run-time reconfigurable processing path assignment System adapts itself to various application requirements Application-aware Load Balancing Strategy Concept Evaluation by System-level simulations (SystemC) Prototype Implementation on FPGA and (Proof of concept) Publications Last Phase 1 Publications: International Workshop Paper 8 International Conference Papers "A Packet Classification Technique for On-Chip Processing Path Decision", WASP 2007, Salzburg 2 Journal Papers "A HW Packet Resequencer Unit for Network Processors", ARCS 2008, Dresden "A Processing Path Dispatcher in Network Processor MPSoCs", IEEE Transactions on VLSI, 10/2008 "FlexPath NP - A Network Processor Architecture with Flexible Processing Paths", SoC 2008, Tampere "An Efficient HW Architecture for Packet Re-sequencing in NP MPSoCs", DSD 2009, Patras "An Application-aware Load Balancing Strategy for NPs", HiPEAC 2010, Pisa FlexPath NP - 16
Outlook FlexPath NP - 17
Outlook Lawrence G. Roberts: "A Radical New Router"...... might be a FlexPath Router! Path Dispatcher determines processing path GP-Processing for unknown packets AutoRoute for known packet streams FlexPath NP - 18
Thank you! FlexPath NP - 19