Multi-Path Elad Raz Mellanox Technologies
Agenda Motivation Introducing Multi-Path Design Status and initial results Next steps Conclusions March 15 18, 2015 #OFADevWorkshop 2
MP- Motivation (1) Failovers and High Availability Support Bandwidth Aggregation L3 datacenter support TOR A TOR B X 10.2.0.0 /16 20.1.0.0 /16 Dev A Dev B Dev A Dev B Host A Host B March 15 18, 2015 #OFADevWorkshop 3
MP- Motivation (2) Transparent migration VM MP- SoftRoCE IB VF driver Eth paravirt Hypervisor Hypervisor IB HCA NIC NIC March 15 18, 2015 #OFADevWorkshop 4
What is Multi-Path? Application HAL and services MP- Verbs Application HAL and services Verbs MP- ib_dev1 roce_dev2 ib_dev1 roce_dev2 Session Level Connection HW 1 HW 2 Underlying Verbs Session Level Connection HW 1 HW 2 Router A Router B March 15 18, 2015 #OFADevWorkshop 5
MP- Operation App rdmacm mp-dev phys-dev1 phys-dev2 Connect Open (virt) resources CM protocol (primary flow) + MP-capability exchange Open (phys) resources Data path (virt) operations Get MP info Data path (phys) operations Traffic Connect Open (phys) resources CM protocol (sub-flow) Data path (virt) operations Data path (phys) operations Traffic March 15 18, 2015 #OFADevWorkshop 6
Connection Migration Virtual consumer Physical producer Send Queue virtual producer Virtual consumer Receive Queue Physical producer virtual producer March 15 18, 2015 #OFADevWorkshop 7
MP- Comparison Automatic Path Migration RoCE-LAG MP- Multi-Port failover Bandwidth aggregation Application agnostic L3 session Multi-device failover Migration support March 15 18, 2015 #OFADevWorkshop 8
MP- and MP-TCP MP-TCP MP- MP Messages TCP options CMA MADs (private data) Address management Add/remove addresses Add/remove CMA address Flow management Add/remove TCP sub-flows Establish/migrate/teardown QPs Communication endpoint TCP socket MP device Data sequencing Sub-flow address combinations Byte-stream divided between sub-flows Flow + session based sequencing Any IP interface to any peer IP interface, subject to middleboxes (e.g., firewalls, NAT) QP and actual HW WQEs (performance) Any addressing to any peer addressing, subject to the same Technology (IB, RoCE, iwarp) March 15 18, 2015 #OFADevWorkshop 9
MP- Design User/kernel mp-rdma driver Device instance hosting MPcapable resources Implements resource virtualization and connection failover Uses underlying physical devices transparently CM MP mgr ib_dev1 CMA MP mgr Application Verbs mp_dev Kernel ULP ib_dev2 CM/CMA support MP capability negotiation cma_dev1 cma_dev2 cm_dev2 cm_dev2 ib_core mp_dev ib_dev1 ib_dev2 HW1 HW2 March 15 18, 2015 #OFADevWorkshop 10
Policies Active backup Dev 0 QP1 QP2 Dev 1 QP1 QP2 QP3 QP3 Load Balancing Dev 0 QP40 QP1 Dev 1 QP8 QP93 Efficiency QP82 SRQ13 QP61 Dev 0 MR16 QP4 CQ42 QP1 QP2 QP74 March 15 18, 2015 #OFADevWorkshop 11
Virtual Namespaces Application 1 MP- (1) vpd1 vmr1 vcq1 vqp1 vqp2 vqp3 vcq2 vsrq1 vqp4 vcq3 QP82 SRQ13 QP61 Dev 0 MR16 CQ42 MR13 QP33 QP98 Dev 1 CQ22 CQ23 March 15 18, 2015 #OFADevWorkshop 12
Resource Creation Virtual resources vqp1 Policy selection Dev 0 : port 1 PD12 Resource creation QP42 Post-send (MR) MR 2 CQ2 March 15 18, 2015 #OFADevWorkshop 13
Data Path Translate MRs: ibv_post_send ibv_post_recv Application 1 MP- (1) Translate QPs: ibv_poll_cq vqp1 vcq1 vpd1 vwq (tx) vwq (rx) Monitor WQs: PSN Completed QP1 CQ1 Dev 0 March 15 18, 2015 #OFADevWorkshop 14
Status and Initial Results User-space driver progressing nicely Resource management Connection management Failover Data path for RC send-receive operations Encouraging initial results Latency vs. Message size (usec) Bandwidth vs. Message Size 9 8 7 Lower is better 50000 45000 40000 Higher is better 6 35000 5 30000 4 25000 3 20000 2 15000 1 10000 0 5000 0 0 10000 20000 30000 40000 50000 60000 70000 Native MP- Native card MP- March 15 18, 2015 #OFADevWorkshop 15
Next Steps Kernel MP- driver and connection model Dynamic device removal notifications and Atomic support Datagram and Multicast support Consider future standardization IBTA CM extensions RFC Open-source the code March 15 18, 2015 #OFADevWorkshop 16
Conclusions MP- solves multiple requirements Multi-devices failovers Transparent BW aggregation Transparent migration Multi-homed hosts Modeled over MP-TCP Promising initial results March 15 18, 2015 #OFADevWorkshop 17
Thank You #OFADevWorkshop