0 A bird in cloud Lagopus: High-performance Software OpenFlow Switch for Wide-area Network Yoshihiro Nakajima NTT Labs 2014 November
Agenda 1 Trends Motivation and Target vswitch Packet processing overview Challenges with Intel x86 servers Technologies for speed Design Implementation Evaluation Future Development Plan Open Source Activity
Our mission 2 Innovate networking with open source software
Agenda 3 Trends Motivation and Target vswitch Packet processing overview Challenges with Intel x86 servers Technologies for speed Design Implementation Evaluation Future Development Plan Open Source Activity
Trend shift in networking 4 Closed (Vender lock-in) Yearly dev cycle Waterfall dev Standardization Protocol Special purpose HW / appliance Distributed cntrl Custom ASIC / FPGA Open (lock-in free) Monthly dev cycle Agile dev DE fact standard API Commodity HW/ Server Logically centralized cntrl Merchant Chip
What is Software-Defined Networking? 5 5 SDN conceptual model Programmability via abstraction layer Enables flexible and rapid service/application development Logically centralized view Hide and abstract complexity of networks, provide entire view of the network Decouple control plane and data plane Free control plane out of the box (OpenFlow is one of APIs) Reference: http://opennetsummit.org/talks/ons2012/pitt-mon-ons.pdf Innovate services and applications in software development speed.
OpenFlow vs conventional NW node 6 Conventional NW node OpenFlow OpenFlow controller Control plane (Routing / swithching) Control plane (routing/ switching) OpenFlow Protocol Data-plane (ASIC, FPGA) Flow match OpenFlow switch Flow Table action Data-plane counter Flow match action counter Flow Table #2 Flow Table #3 Flexible flow match pattern (Port #, VLAN ID, MAC addr, IP addr, TCP port #) Action (frame processing) (output port, drop, modify/pop/push header) Flow statistics (# of packet, byte size, flow duration, )
Why SDN? 7 7 Differentiate services (Innovation) Increase user experience Provide unique service which is not in the market Time-to-Market(Velocity) Not to depend on vendors feature roadmap Develop necessary feature when you need Cost-efficiency Reduce OPEX by automated workflows Reduce CAPEX by COTS hardware
8 Evaluate the benefits of SDN by implementing our control plane and switch Copyright 2014 NTT corp. All Rights Reserved. 8
Two Bible 9
Agenda 10 Trends Motivation and Target vswitch Packet processing overview Challenges with Intel x86 servers Technologies for speed Design Implementation Evaluation Future Development Plan Open Source Activity
Lessons learned from existing vswitch 11 Not enough performance Packet processing speed < 1Gbps 10K flow entry add > two hours Flow management processing is too heavy Develop & extension difficulties Many abstraction layer cause confuse for me Interface abstraction, switch abstraction, protocol abstraction Packet abstraction, Too complex Many packet processing codes exists for the same processing Userspace, Kernelspace Invisible configuration exists Invisible flow rules cause chaos debugging in flow design
Motivation of vswitch 12 Agile and flexible networking Full automation in provisioning, operation and management Seamless networking for customers Server virtualization and NFV needs a high-performance software switch Short latency Wire-rate in case of short packet (64B) No High-performance OpenFlow 1.3 software switch for wide-area networks 1M flow rules 10Gbps-wire-rate Multiple table supports
vswitch requirement from user side 13 1. Run on the commodity PC server and NIC 2. Provide a gateway function to allow connect different various network domains Support of packet frame type in DC, IP-VPN, MPLS and access NW 3. Achieve 10Gbps-wire rate with >= 1M flow rules low-latency packet processing flexible flow lookup using multiple-tables High performance flow rule setup/delete 4. Run in userland and decrease tight-dependency to OS kernel easy software upgrade and deployment 5. Support various management and configuration protocols.
Why Intel x86 server for networking? 14 Strong processing power Many CPU cores 8 cores/cpu (2012), 12 cores/cpu (2013), Rapid development cycle Available everywhere Can be purchased in Akihabara! 3 month lead time in case of special purpose HW Reasonable price and flexible configuration 8 core CPU (ATOM) with 1,000 USD Can chose configuration according to your budget Many programmers and engineers Porting to a NPU needs special skill and knowledge
Agenda 15 Trends Motivation and Target vswitch Packet processing overview Challenges with Intel x86 servers Technologies for speed Design Implementation Evaluation Future Development Plan Open Source Activity
Typical packet processing 16 packet RX Processing in NIC driver Frame parsing Parse packet for post prosessing Take a bible Flow lookup (match/action) Lookup, Header rewrite QoS Queue Policer, Shaper Marking TX packet
Typical switch architecture 17 Other control / management system BGP, OSPF, OpenFlow Switch agent (dynamic routing handling, dataplane ctrl) CLI BGP Dataplane cntrl message Events (link up, down) statistics Dataplane (packet processing forwarding) Flow table statistics
OpenFlow 18 OpenFlow OpenFlow controller Control plane (routing/ switching) OpenFlow Protocol Flow match OpenFlow switch Flow Table action Data-plane counter Flow match action counter Flow Table #2 Flow Table #3 Flexible flow match pattern (Port #, VLAN ID, MAC addr, IP addr, TCP port #) Action (frame processing) (output port, drop, modify/pop/push header) Flow statistics (# of packet, byte size, flow duration, )
# of packets per seconds Copyright 2014 NTT corp. All Rights Reserved. # of packet to be proceeded for 10Gbps 19 16,000,000 14,000,000 12,000,000 Short packet 64Byte 14.88 MPPS, 67.2 ns 10,000,000 8,000,000 6,000,000 Computer packet 1KByte 1.2MPPS, 835 ns 4,000,000 2,000,000 0 0 256 512 768 1024 1280 Packet size (Byte)
Agenda 20 Trends Motivation and Target vswitch Packet processing overview Challenges with Intel x86 servers Technologies for speed Design Implementation Evaluation Future Development Plan Open Source Activity
Intel x86 server configuration 21 QPI Memory CPU CPU Memory NIC PCI-Exp PCI-Exp NIC NIC NIC Reference: supermicro X9DAi
Recent CPU details 22 CPU socket has multiple CPU cores 6 18 CPU cores in Intel Haswell CPU has hieratical $ system Each CPU core has separate L1 and L2 $ Multiple CPU cores share L3 $ CPU core are connected with multiple ring network Multiple memory controller 4 memory controller on Intel Xeon E5-2600 2 memory controller on Desktop CPU Refer intel haswell doc Copyright 2014 NTT corp. All Rights Reserved.
CPU $ and memory 23 CPU $ accelerate speed using data locality Data processing in CPU $ can be performed so fast! Data access to main memory takes long time Data prefetching mechanism can be leveraged when sequential data access Size (Byte) Bandwidth (GB/s) Latency - random L1 $ 32K 699 4 clock L2 $ 256K 228 12 clock L3 $ 6M 112 44 clock Memory G class 20 300 cycles (91 ns) Copyright 2014 NTT corp. All Rights Reserved.
# of CPU cycle for typical packet processing (2Ghz CPU) Copyright 2014 NTT corp. All Rights Reserved. 24 10Gbps 1Gbps DPI Simple OF processing TCP termination L2-L4 classification IP routing L2 forwarding 0 1000 2000 3000 4000 5000 6000 # of required CPU cycles
# of packet to be proceeded for 10Gbps with 1 CPU core # of packets per seconds Copyright 2014 NTT corp. All Rights Reserved. 25 16,000,000 14,000,000 12,000,000 Short packet 64Byte 14.88 MPPS, 67.2 ns 2Ghz: 134 clocks 3Ghz: 201 clocks 10,000,000 8,000,000 6,000,000 4,000,000 2,000,000 Computer packet 1KByte 1.2MPPS, 835 ns 2Ghz: 1670 clocks 3Ghz: 2505 clocks 0 0 256 512 768 1024 1280 Packet size (Byte)
Standard packet processing on Linux/PC Userspace packet processing (Event-based) vswitch agent Data plane User space Kernel space packet processing (Event-based) vswitch agent 26 Socket API 2. system call (read) 1. Interrupt & DMA skb_buf Ethernet Driver API 3. system call (write) 4. DMA Kernel space 1. Interrupt & DMA Socket API vswitch Data plane skb_buf Ethernet Driver API 2. DMA NIC packet buffer Driver NIC packet buffer
Standard packet processing on Linux/PC Userspace packet processing (Event-based) vswitch agent dataplane Contexts switch User space Kernel space packet processing (Event-based) vswitch agent 27 Socket API 2. system call (read) 1. Interrupt & DMA skb_buf Ethernet Driver API 3. system call (write) 4. DMA Many memory copy / read Kernel space 1. Interrupt & DMA Socket API vswitch dataplane skb_buf Ethernet Driver API 2. DMA NIC packet buffer Driver Interrupt-based NIC packet buffer
Performance issues 28 1. Massive RX interrupts handling for NIC device 2. Heavy overhead of task switch 3. Lower performance of PCI-Express I/O and memory bandwidth compared with CPU 4. Shared data access is bottleneck between threads 5. Huge amount of $ miss/update in data translation lookaside buffer (TLB) (4KB)
Agenda 29 Trends Motivation and Target vswitch Packet processing overview Challenges with Intel x86 servers Technologies for speed Design Implementation Evaluation Future Development Plan Open Source Activity
Technologies for speed 30 Leverage multi-core CPU Pipeline processing Parallel processing Multi thread programming Leverage high-performance I/O libs Intel DPDK Leverage lock-free memory libs lock-less queue RCU (read copy update) Design high performance flow lookup Refer bible Copyright 2014 NTT corp. All Rights Reserved.
Basic techniques for performance (1/2) 31 Pipeline Divide whole processing to processing elements Give additional time for processing (x # of elements) RX lookup action QoS TX RX lookup action QoS TX RX lookup action QoS TX RX lookup action QoS TX RX lookup action QoS TX RX lookup action QoS TX Parallelization Parallel execution when multiple processing units Increase PPS throughput (x # of processing units) RX lookup action QoS TX RX lookup action QoS TX RX lookup action QoS TX RX lookup action QoS TX RX lookup action QoS TX RX lookup action QoS TX
lock Unlock Copyright 2014 NTT corp. All Rights Reserved. Basic techniques for performance (2/2) 32 Packet batching Reduce # of lock in flow lookup table Frequent locks are required Switch OpenFlow agent and counter retrieve for SNMP Packet processing Naïve implementation Input packet Lookup table Lock is required for each packet Lock Unlock Packet batching implementation Input packet Lookup table Lock can be reduced due to packet batching
Implementation techniques for speed 33 Bypass in packet processing Direct data copy from NIC buffer to CPU $ (Intel Data Direct I/O) Kernel processing bypass of network stack Fast flow lookup using flow $ mechanism Reduce memory access and leverage locality High-speed flow lookup algorithm Utilize thread local storage Explicit assignment of CPU and processing thread Hardware-aware assignment (Topology of CPU, bus, and NIC)
Intel Data Plane Development Kit 34 x86 architecture-optimized data-plane library and NIC driver Memory structure-aware queue, buffer management packet flow classification polling mode-based NIC driver Low-overhead & high-speed runtime optimized with dataplane processing Abstraction layer for hetero server environments BSD-license
Intel DPDK accelerate performance 35
DPDK Apps 36 Socket API 2. system call (read) 1. Interrupt & DMA Userspace packet processing (Event-based) vswitch Dataplane skb_buf agent Ethernet Driver API User space 3. system call (write) 4. DMA Kernel space DPDK apps (polling-based) vswitch Socket API agent DPDK dataplane 1. DMA Write Ethernet Driver API 2. DMA READ NIC packet buffer Driver NIC packet buffer
Agenda 37 Trends Motivation and Target vswitch Packet processing overview Challenges with Intel x86 servers Technologies for speed Design Implementation Evaluation Future Development Plan Open Source Activity
Target of Lagopus vswitch 38 High performance soft-based OpenFlow switch 10Gbps wire-rate packet processing / port 1M flow entries Expands the SDN apps to wide-area networks Not only for data centers WAN protocols, e.g. MPLS and PBB Various management /configuration interfaces Copyright 2014 NTT corp. All Rights Reserved.
Approach for vswitch development 39 Simple is better for everything Packet processing Protocol handling Straight forward approach No OpenFlow 1.0 support Full scratch (No use of existing vswitch code) User land packet processing as much as possible, keep kernel code small Kernel module update is hard for operation Every component & algorithm can be replaced Flow lookup, library, protocol handling
vswitch design 40 Simple modular-based design Switch control agent Data-plane Switch control agent Unified configuration data store Multiple management IF Data-plane control with HAL Data-plane High performance I/O library (Intel DPDK) Raw socket for test
Implementation strategy for vswitch 41 1. Massive RX interrupts handling for NIC device => Polling-based packet receiving 2. Heavy overhead of task switch => Explicit thread assignment, no kernel scheduler (one thread/one physical CPU) 3. Lower performance of PCI-Express I/O and memory bandwidth compared with CPU => Reduction of # of access in I/O and memory 4. Shared data access is bottleneck between threads => Lockless-queue, RCU, batch processing 5. Huge amount of $ miss/update in data translation lookaside buffer (TLB) (4KB) => Huge DTLB (2MB - 1GB)
Agenda 42 Trends Motivation and Target vswitch Challenges with Intel x86 servers Design Implementation Evaluation Future Development Plan Open Source Activity Copyright 2014 NTT corp. All Rights Reserved.
Packet processing using multi core CPUs 43 Exploit many core CPUs Reduce data copy & move (reference access) Simple packet classifier for parallel processing in I/O RX Decouple I/O processing and flow processing Improve D-cache efficiency Explicit thread assign to CPU core Ring buffer NIC 1 RX NIC 2 RX NIC 3 RX NIC 4 RX NIC RX buffer I/O RX CPU0 I/O RX CPU1 Flow lookup packet processing CPU2 Flow lookup packet processing CPU3 Flow lookup packet processing CPU4 Flow lookup packet processing CPU5 Ring buffer I/O TX CPU6 I/O TX CPU7 NIC TX buffer NIC 1 TX NIC 2 TX NIC 3 TX NIC 4 TX
Packet batching 44 Reduce # of lock in flow lookup table Frequent locks are required Switch OpenFlow agent and counter retrieve for SNMP Packet processing Naïve implementation Input packet Lookup table Lock is required for each packet Lock Unlock Packet batching implementation Input packet Lookup table Lock can be reduced due to packet batching ロック Unlock Copyright 2014 NTT corp. All Rights Reserved.
Bypass pipeline with flow $ 45 Reduce # of flow lookup in multiple table Exploit flow $ that allows bypass of multi-flowlookup Naïve implementation table1 table2 table3 Input packet Output packet Lagopus implementation Input packet Flow cache 1. New flow table1 table2 table3 Output packet Write flow $ Multiple flow $ generation & mngmt 2. Success flow 45 Packet Copyright 2014 NTT corp. All Rights Reserved.
Agenda 46 Trends Motivation and Target vswitch Packet processing overview Challenges with Intel x86 servers Technologies for speed Design Implementation Evaluation Future Development Plan Open Source Activity
OpenFlow 1.3 Conformance Status 47 Conformance test results by Ryu Certification http://osrg.github.io/ryu/certification.html
Performance Evaluation 48 Summary Throughput: 10Gbps wire-rate Flow entries: 1M flow entry support >= 4000 flow rule manipulation / sec Evaluation models WAN-DC gateway MPLS-VLAN mapping L2 switch Mac address switching
Evaluation scenario 49 Usecase:Cloud-VPN gateway From ONS2014 NTT COM Ito-san presentation Tomorrow: Automatic connection setup via North Bound APIs SDN controller maintain mapping between tenant logical network and VPN Routes are advertised via ebgp, no need to configure ASBRs on provider side API SDN controller Data center Network OF ebgp MPLS Inter-AS option B MPLS-VPN
Performance Evaluation 50 Evaluation setup tester Packet size Flows Server Flow cache (on/off) Lagopus Flow table Flow rules Throughput (bps/pps/%) Server spec. CPU: Dual Intel Xeon E5-2660 8 core(16 thread), 20M Cache, 2.2 GHz, 8.00GT/s QPI, Sandy bridge Memory: DDR3-1600 ECC 64GB Quad-channel 8x8GB Chipset: Intel C602 NIC: Intel Ethernet Converged Network Adapter X520-DA2 Intel 82599ES, PCIe v2.0
WAN-DC Gateway Throughput vs packet size, 1 flow, no flow-cache Copyright 2014 NTT corp. All Rights Reserved. 51 10 9 Throughput (Gbps) 8 7 6 5 4 3 2 10 flow rules 100 flow rules 1k flow rules 10k flow rules 100k flow rules 1M flow rules 1 0 0 200 400 600 800 1000 1200 1400 1600 Packet size (byte)
WAN-DC Gateway Throughput vs packet size, 1 flow, flow-cache Copyright 2014 NTT corp. All Rights Reserved. 52 10 9 Throughput (Gbps) 8 7 6 5 4 3 2 10 flow rules 100 flow rules 1k flow rules 10k flow rules 100k flow rules 1M flow rules 1 0 0 200 400 600 800 1000 1200 1400 1600 Packet size (byte)
WAN-DC Gateway Throughput vs flows, 1518 bytes packet Copyright 2014 NTT corp. All Rights Reserved. 53 10 9 8 Throughput (Gbps) 7 6 5 4 3 2 1 0 10k flow rules 100k flow rules 1M flow rules 1 10 100 1000 10000 100000 1000000 flows
L2 switch performance (Mbps) 10GbE x 2 (RFC2889 test) Mbps Copyright 2014 NTT corp. All Rights Reserved. 54 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 LINC OVS (netdev) OVS (kernel) Lagopus (software) Packet size 72 128 256 512 1024 1280 1518 Intel xeon E5-2660 (8cores, 16threads), Intel X520-DA2, DDR3-1600 64GB
Agenda 55 Trends Motivation and Target vswitch Challenges with Intel x86 servers Design Implementation Evaluation Future Development Plan Open Source Activity Copyright 2014 NTT corp. All Rights Reserved.
Y2014 development plan 56 Version 0.1, 2014 Q2 Pure OpenFlow switch Intel DPDK 1.6.0 CLI Version 0.3, 2014 Q4 Hypervisor integration Virtual NIC with KVM KVM and OpenStack for NFV Version 0.7, 2015 Q1 40Gbps-wirerate high-performance switch Legacy protocol (bridge, L3) New configuration system (OVSDB, OF-CONFIG, new CLI) Version 1.0, 2015 Q2 Tunnel handling (VxLAN, GRE) for overlay NW in L3 NW SNMP Hardware implementation Bare-metal switch with merchant silicon with Advantech
57 One more thing Location-aware bandwidthcontrolled WiFI with Ryu and Lagopus @ SDN Japan 2014
Concept 58 Good audience can connect to the Internet with guaranteed BW Front area sheet, good connection. Back area sheet, poor connection
Location-aware bandwidthcontrolled WiFi @ SDN Japan 2014 Copyright 2014 NTT corp. All Rights Reserved. 59
Location-aware bandwidthcontrolled WiFi Copyright 2014 NTT corp. All Rights Reserved. 60 Front Screen 50Mbps 30Mbps 20Mbps Back
Location-aware bandwidthcontrolled WiFi Copyright 2014 NTT corp. All Rights Reserved. 61 Front Screen 50Mbps 30Mbps 20Mbps Back Good audience can connect to the Internet with guaranteed BW
Result 62 Many audience in back area sheet use mobile WIFI (tethering by smartphone) to connect to the Internet Next step: We need 3G or LTE mobile radio jamming hardware to shut down ce
Network topology 63 NAT/DHCP server untagged: rtx1200 wlc WIFI controller trunk: cat3750x vid: 101 vid: 102 vid: 103 Copyright 2014 NTT corp. All Rights Reserved.
Lagopus: Flow table design 64 rtx1200 2 3 wlc 1 id: 0 (MAC learning) match: src mac action: pop vlan, goto id:1 default: packet-in cat3750x id: 1 (Flooding or forwarding) match: dst mac action: push vlan, meter, output default: flood Group: flooding actions 3 pkt -> flood 2 pkt -> flood 1 put -> flood meter 50Mbps 30Mbps 20Mbps
Lagopus: Group table design 65 group Group Type: ALL, SELECT, INDIRECT, FF ALL for flooding 3 pkt -> flood 2 pkt -> flood 1 put -> flood buckets [2], [push101, 1], [push 102, 2], [push 103, 3] [3], [push101, 1], [push 102, 2], [push 103, 3] [2], [3] All each action set can be executed rtx1200 2 3 wlc 1 Copyright 2014 NTT corp. All Rights Reserved. cat3750x
Packet processing flow (1/2) 66 rtx1200 2 3 wlc match: src mac action: pop vlan, goto id:1 id: 0 101 aa..aa ff..ff from 1 default: packet-in Flow entry registration match: dst mac action: push vlan, meter, output id: 1 default: flood Packet-out cat3750x group meter 1 3 pkt -> flood 2 pkt -> flood 1 put -> flood 50Mbps 30Mbps 20Mbps
Packet processing flow (2/2) 67 rtx1200 2 3 wlc match: src mac action: pop vlan, goto id:1 id: 0 if src mac then pop vlan, goto id:1 aa..aa default: packet-in flow entry registaration match: dst mac action: push vlan, meter, output id: 1 if dst mac aa..aa then push vlan101, meter 50Mbps, output1 default: flood cat3750x group meter 1 3 pkt -> flood 2 pkt -> flood 1 put -> flood 50Mbps 30Mbps 20Mbps
68 Enjoy hacking in networking