Bonding / Team drivers



Similar documents
Network Virtualization Technologies and their Effect on Performance

Multi-Path RDMA. Elad Raz Mellanox Technologies

RoCE vs. iwarp Competitive Analysis

OFA Training Program. Writing Application Programs for RDMA using OFA Software. Author: Rupert Dance Date: 11/15/

How Linux kernel enables MidoNet s overlay networks for virtualized environments. LinuxTag Berlin, May 2014

Programmable Networking with Open vswitch

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Data Center Bridging Attributes. John Fastabend LAN Access Division, Intel Corp.

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

Creating Overlay Networks Using Intel Ethernet Converged Network Adapters

Introduction to MPIO, MCS, Trunking, and LACP

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

PCI Express IO Virtualization Overview

High-performance vswitch of the user, by the user, for the user

Performance Evaluation of the RDMA over Ethernet (RoCE) Standard in Enterprise Data Centers Infrastructure. Abstract:

Network Function Virtualization Using Data Plane Developer s Kit

Operating Systems Design 16. Networking: Sockets

Utiliser l Intel DPDK Communauté dpdk.org

OVERLAYING VIRTUALIZED LAYER 2 NETWORKS OVER LAYER 3 NETWORKS

ODP Application proof point: OpenFastPath. ODP mini-summit

A way towards Lower Latency and Jitter

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

Hyper-V Networking. Aidan Finn

Telecom - The technology behind

SDN software switch Lagopus and NFV enabled software node

Software Datapath Acceleration for Stateless Packet Processing

Advanced Computer Networks. High Performance Networking I

Software Defined Networking using VXLAN

MoonGen. A Scriptable High-Speed Packet Generator Design and Implementation. Paul Emmerich. January 30th, 2016 FOSDEM 2016

MLAG on Linux - Lessons Learned. Scott Emery, Wilson Kok Cumulus Networks Inc.

A Tour of the Linux OpenFabrics Stack

State of the Art Cloud Infrastructure

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

RDMA Aware Networks Programming User Manual

Linux NIC and iscsi Performance over 40GbE

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

VXLAN: Scaling Data Center Capacity. White Paper

Data Center Infrastructure of the future. Alexei Agueev, Systems Engineer

Benchmarking Virtual Switches in OPNFV draft-vsperf-bmwg-vswitch-opnfv-00. Maryam Tahhan Al Morton

OpenCPN Garmin Radar Plugin

SCLP: Segment-oriented Connection-less Protocol for High-Performance Software Tunneling in Datacenter Networks

CORD Fabric, Overlay Virtualization, and Service Composition

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Overlay networking with OpenStack Neutron in Public Cloud environment. Trex Workshop 2015

Nutanix Tech Note. VMware vsphere Networking on Nutanix

CERN Cloud Infrastructure. Cloud Networking

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Cloud Networking Disruption with Software Defined Network Virtualization. Ali Khayam

Building Enterprise-Class Storage Using 40GbE

IPv6 Challenges for Embedded Systems István Gyürki

IP SAN Fundamentals: An Introduction to IP SANs and iscsi

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Advanced Computer Networks. Network I/O Virtualization

Assessing the Performance of Virtualization Technologies for NFV: a Preliminary Benchmarking

Architektur XenServer

10 Gbit Hardware Packet Filtering Using Commodity Network Adapters. Luca Deri Joseph Gasparakis

Extending Networking to Fit the Cloud

Xeon+FPGA Platform for the Data Center

High-Density Network Flow Monitoring

How To Use Vsphere On Windows Server 2012 (Vsphere) Vsphervisor Vsphereserver Vspheer51 (Vse) Vse.Org (Vserve) Vspehere 5.1 (V

Tyche: An efficient Ethernet-based protocol for converged networked storage

Bring your virtualized networking stack to the next level

Router Architectures

Network Virtualization and Software-defined Networking. Chris Wright and Thomas Graf Red Hat June 14, 2013

Software Defined Networking (SDN) OpenFlow and OpenStack. Vivek Dasgupta Principal Software Maintenance Engineer Red Hat

PE10G2T Dual Port Fiber 10 Gigabit Ethernet TOE PCI Express Server Adapter Broadcom based

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Evaluation of 40 Gigabit Ethernet technology for data servers

Performance of Software Switching

IO Visor Project Overview

Qualifying SDN/OpenFlow Enabled Networks

Scaling the S in SDN at Azure. Albert Greenberg Distinguished Engineer & Director of Engineering Microsoft Azure Networking

Foundation for High-Performance, Open and Flexible Software and Services in the Carrier Network. Sandeep Shah Director, Systems Architecture EZchip

Open Source Network: Software-Defined Networking (SDN) and OpenFlow

How To Make A Network Virtualization In Cumulus Linux (X86) (Powerpc) (X64) (For Windows) (Windows) (Amd64) And (Powerpci) (Win2

Intel DPDK Boosts Server Appliance Performance White Paper

Underneath OpenStack Quantum: Software Defined Networking with Open vswitch

Using Network Virtualization to Scale Data Centers

Internetworking II: VPNs, MPLS, and Traffic Engineering

Virtual Ethernet Bridging

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308

TECHNICAL NOTE. Technical Note P/N REV 03. EMC NetWorker Simplifying firewall port requirements with NSR tunnel Release 8.

Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro

Knut Omang Ifi/Oracle 19 Oct, 2015

Removing The Linux Routing Cache

SOFTWARE-DEFINED NETWORKING AND OPENFLOW

Introduction to SFC91xx ASIC Family

Accelerating 4G Network Performance

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Intel Ethernet and Configuring Single Root I/O Virtualization (SR-IOV) on Microsoft* Windows* Server 2012 Hyper-V. Technical Brief v1.

Jun (Jim) Xu Principal Engineer, Futurewei Technologies, Inc.

Network Function Virtualization: Virtualized BRAS with Linux* and Intel Architecture

Configuring IP to Serial with Auto Answer and Serial to IP

Transcription:

HW High-Availability and Link Aggregation for Ethernet switch and NIC RDMA using Linux bonding/team Tzahi Oved tzahio@mellanox.com ; Or Gerlitz ogerlitz@mellanox.com Netdev 1.1 2016

Bonding / Team drivers both expose software netdevice that provides LAG / HA toward the networking stack team/bond is considered upper device to lower NIC net-devices through which packets are flowing to the wire different modes of operation: Active/Passive, 802.3ad (LAG) and policies: link monitoring, xmit hash, etc Bonding legacy Team - introduced in 3.3, more modular/flexible design, extendable, state-machine in user-space library/daemon

HW LAG using SW Team/Bond Idea: use SW LAG on netdevices to apply LAG into HW offloaded traffic offloaded traffic doesn t pass through the network stack 100Gbs Switch each port is represented by netdevice SW LAG on few ports netdevs set HW LAG on physical ports (mlxsw, upstream 4.5) 40/100Gbs NIC each port of the device is Eth netdevice RDMA traffic is offloaded from the network stack port netdevice serves for plain Eth networking and control pass for the RDMA stack SW LAG on two NIC ports netdevs HW LAG for RDMA traffic (mlx4, upstream 4.0) under, SW LAG on PF NIC ports HW LAG for vport used by VF (mlx4, upstream 4.5) for 100Gbs NIC (mlx5) coming soon

Network notifiers && their usage for HW LAG notification sent to subscribed consumers in the networking stack on a change which is about to take place, or that just happened the notification contains events type and affected parties Notifications used for LAG: pre change-upper, change-upper HW driver usage for LAG notifications: pre-change upper: refuse certain configurations, NAK the change change upper: create / configure HW LAG

Switch HW driver ip link set dev sw1p1 master team0 NETDEV_PRECHANGEUPPER if lag type is not LACP, etc - NAK operation fails NETDEV_CHANGEUPPER observe that new lag is created for the switch create HW LAG and add this port there ip link set dev sw1p2 master team0 NETDEV_PRECHANGEUPPER [ ] NETDEV_CHANGEUPPER observe that this lag already exists add this port there team switch driver

RDMA over Ethernet (RoCE) / RDMA-CM The upstream RDMA stack supports multiple transports: RoCE, IB, iwarp RoCE RDMA over Converged Ethernet, RoCE V2 (upstream 4.5), IBTA RDMA headers over UDP. Uses IPv4/6 addresses set over the regular Eth NIC port netdev RoCE apps use RDMA-CM API for control path and verbs API for data path RDMA-CM API (include/rdma/rdma_cm.h) Address resolution Local Route lookup + ARP/ND services (rdma_resolve_addr()) Route resolution Path lookup in IB networks (rdma_resolve_route()) Connection establishment per transport CM to wire the offloaded connection (rdma_connect()) Verbs API Send/RDMA Send message or perform RDMA operation (post_send()) Poll Poll for completion of Send/RDMA or Receive operation (poll_cq()) Async completion handling and fd semantics are supported Post Receive Buffer Hand receive buffers to the NIC (post_recv()) RDMA Device The DEVICE structure, exposes all above operations Associated with net_device Available for both RoCE and user mode Ethernet programming (DPDK)

Native Model HW Teaming Configuration Native Linux administration RoCE Bonding is mainly auto configured RoCE Use transport object (QP, TIS) attribute: port affinity RDMA devices associated with eth0, eth1 will be used for port management only (through Immutable caps) And will unregister and register to drop existing consumers Register new ib_dev attached to the bond eth0, eth1 will listen on Linux bond enslavement netlink events New RDMA device will always use vendor pick of Function (PF0/1 or both) LACP ((802.3ad) Either handled by Linux bonding/teaming driver Or in HW/FW for supporting NICs (required for many PFs to single phys port configurations) HW Bond NIC logic for HW forwarding of ingress traffic to bond/team RDMA device net_dev traffic is passed directly to owner net_dev according to ingress port NIC RDMA Device Linux Bonding/ Teaming eth0 PF0 Port1 HW Bond eth1 PF1 Port2

eswitch Software Model Option I VM2 VM3 Native OS Linux/OVS Bridge VM0 VM1 RDMA Device br0 eth0 rep_vf0 rep_vf1 Linux Switch Device PF0 NIC Port eswitch VF0.0 VF0.1

eswitch Software Model Option II VM2 VM3 Native OS Linux/OVS Bridge VM0 VM1 RDMA Device eth0 rep_eth0 rep_phy0 rep_vf0 rep_vf1 Linux Switch Device PF0 NIC Port eswitch VF0.0 VF0.1

eswitch Software Model with HA Native OS Linux/OVS Bridge VM0 VM1 Linux Bonding RDMA Device Linux Switch Device eth0 rep_eth0 rep_phy0 rep_phy1 rep_vf0 rep_vf1 NIC PF0 Port1 eswitch HW Bond PF1 Port2 VF0.0 VF1.0

eswitch Software Model with Tunneling VM2 VM3 VM0 VM1 OVS-VX Bridge Linux/OVS Bridge UDP/IP Stack vxlan net_device VNI (Key) rep_eth0 rep_phy0 rep_vf0 rep_vf1 RDMA Device Linux Switch Device PF0 eth0 NIC eswitch HW Tunnel VF0.0 VF0.1 Port

Multi-PCI Socket NIC Multiple end point NIC - NIC can be connected through one or more buses Each bus is connected different NUMA node Exposed as 2 or more net_device each with it s own associated RDMA device Enjoy direct device to local NUMA access Application use & feel would like to work with single net interface Use Linux bonding with RDMA device bonding For TCP/IP traffic on TX, select slave according to calling context affinity For RDMA traffic select slave according to: Transport object (QP) logical port affinity Or transport object creation thread CPU affinity Don t share HW resources (CQ, SRQ) on different CPU sockets each device has it s own HW resources CPU QPI CPU RDMA Device NIC Linux Bonding/ Teaming eth0 PF0 HW Bond Port eth1 PF1