Brain-Slug: a BGP-Only SDN for Large-Scale Data-Centers

Similar documents

Experiences with BGP in Large Scale Data Centers: Teaching an old protocol new tricks

Core and Pod Data Center Design

Virtual PortChannels: Building Networks without Spanning Tree Protocol

Building a small Data Centre

SOFTWARE DEFINED NETWORKS REALITY CHECK. DENOG5, Darmstadt, 14/11/2013 Carsten Michel

Transitioning to BGP. ISP Workshops. Last updated 24 April 2013

Understanding Virtual Router and Virtual Systems

SDN IN WAN NETWORK PROGRAMMABILITY THROUGH CENTRALIZED PATH COMPUTATION. 1 st September 2014

Understanding Route Redistribution & Filtering

How To Understand Bg

TECHNOLOGY WHITE PAPER. Correlating SDN overlays and the physical network with Nuage Networks Virtualized Services Assurance Platform

Real-Time Traffic Engineering Management With Route Analytics

TE in action. Some problems that TE tries to solve. Concept of Traffic Engineering (TE)

WHITEPAPER. Bringing MPLS to Data Center Fabrics with Labeled BGP

Transforming Evolved Programmable Networks

STRUCTURE AND DESIGN OF SOFTWARE-DEFINED NETWORKS TEEMU KOPONEN NICIRA, VMWARE

DATA CENTER. Best Practices for High Availability Deployment for the Brocade ADX Switch

IMPLEMENTING CISCO IP ROUTING V2.0 (ROUTE)

Border Gateway Protocol (BGP)

Measuring IP Network Routing Convergence. A new approach to the problem

Demonstrating the high performance and feature richness of the compact MX Series

Ethernet-based Software Defined Network (SDN) Cloud Computing Research Center for Mobile Applications (CCMA), ITRI 雲端運算行動應用研究中心

Data Center Convergence. Ahmad Zamer, Brocade

Bell Aliant. Business Internet Border Gateway Protocol Policy and Features Guidelines

Introduction. The Inherent Unpredictability of IP Networks # $# #

Border Gateway Protocol Best Practices

CSCI-1680 So ware-defined Networking

The Complete IS-IS Routing Protocol

Inter-domain Routing Basics. Border Gateway Protocol. Inter-domain Routing Basics. Inter-domain Routing Basics. Exterior routing protocols created to:

MPLS WAN Explorer. Enterprise Network Management Visibility through the MPLS VPN Cloud

ORACLE OPS CENTER: PROVISIONING AND PATCH AUTOMATION PACK

Fast Reroute Techniques in MPLS Networks. George Swallow

BFD. (Bidirectional Forwarding Detection) Does it work and is it worth it? Tom Scholl, AT&T Labs NANOG 45

Load balancing and traffic control in BGP

DEMYSTIFYING ROUTING SERVICES IN SOFTWAREDEFINED NETWORKING

Deploy Application Load Balancers with Source Network Address Translation in Cisco Programmable Fabric with FabricPath Encapsulation

SAN Conceptual and Design Basics

IP Routing Configuring RIP, OSPF, BGP, and PBR

High Availability Failover Optimization Tuning HA Timers PAN-OS 6.0.0

TRILL for Service Provider Data Center and IXP. Francois Tallet, Cisco Systems

Chapter 2 Lab 2-2, EIGRP Load Balancing

Data Center Use Cases and Trends

Using the Border Gateway Protocol for Interdomain Routing

Route Discovery Protocols

8000 Intelligent Network Manager

Monitoring BGP and Route Leaks using OpenBMP and Apache Kafka

Data Center Networking Designing Today s Data Center

Best Practices for Eliminating Risk from Routing Changes

Module 7. Routing and Congestion Control. Version 2 CSE IIT, Kharagpur

basic BGP in Huawei CLI

Outline. Institute of Computer and Communication Network Engineering. Institute of Computer and Communication Network Engineering

TechBrief Introduction

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Data Center Infrastructure of the future. Alexei Agueev, Systems Engineer

BGP Attributes and Path Selection

Brocade to Cisco Comparisons

SDN CONTROLLER. Emil Gągała. PLNOG, , Kraków

SDN and Data Center Networks

Broadband Networks. Prof. Karandikar. Department of Electrical Engineering. Indian Institute of Technology, Bombay. Lecture - 26

Load balancing and traffic control in BGP

Regaining MPLS VPN WAN Visibility with Route Analytics. Seeing through the MPLS VPN Cloud

Redundancy & the Netnod Internet Exchange Points

Introduction to HA Technologies: SSO/NSF with GR and/or NSR. Ken Weissner / kweissne@cisco.com Systems and Technology Architecture, Cisco Systems

IP Addressing A Simplified Tutorial

Transform Your Business and Protect Your Cisco Nexus Investment While Adopting Cisco Application Centric Infrastructure

Fast Re-Route in IP/MPLS networks using Ericsson s IP Operating System

A Link Load Balancing Solution for Multi-Homed Networks

BGP1 Multihoming and Traffic Engineering

Outline. EE 122: Interdomain Routing Protocol (BGP) BGP Routing. Internet is more complicated... Ion Stoica TAs: Junda Liu, DK Moon, David Zats

Software Defined Networking What is it, how does it work, and what is it good for?

Internet inter-as routing: BGP

How To Fix Bg Convergence On A Network With A Bg-Pic On A Bgi On A Pipo On A 2G Network

MPLS RSVP-TE Auto-Bandwidth: Practical Lessons Learned

Routing & Traffic Analysis for Converged Networks. Filling the Layer 3 Gap in VoIP Management

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

MPLS RSVP-TE Auto-Bandwidth: Practical Lessons Learned. Richard A Steenbergen <ras@gt-t.net>

F5 Application Delivery in a Virtual Network

A Coordinated. Enterprise Networks Software Defined. and Application Fluent Programmable Networks

ENSURING RAPID RESTORATION IN JUNOS OS-BASED NETWORKS

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

Network Virtualization for the Enterprise Data Center. Guido Appenzeller Open Networking Summit October 2011

Flexible SDN Transport Networks With Optical Circuit Switching

OVERLAYING VIRTUALIZED LAYER 2 NETWORKS OVER LAYER 3 NETWORKS

IPv6 Troubleshooting for Helpdesks

Cisco Discovery 3: Introducing Routing and Switching in the Enterprise hours teaching time

Network-Wide Change Management Visibility with Route Analytics

Network-Wide Capacity Planning with Route Analytics

PRASAD ATHUKURI Sreekavitha engineering info technology,kammam

Advanced BGP Policy. Advanced Topics

Disaster Recovery Design Ehab Ashary University of Colorado at Colorado Springs

RFC 2547bis: BGP/MPLS VPN Fundamentals

Towards a Next- Generation Inter-domain Routing Protocol. L. Subramanian, M. Caesar, C.T. Ee, M. Handley, Z. Mao, S. Shenker, and I.

Cisco Change Management: Best Practices White Paper

BGP (Border Gateway Protocol)

How To Understand and Configure Your Network for IntraVUE

Roman Hochuli - nexellent ag / Mathias Seiler - MiroNet AG

Transcription:

Global Foundation Services C# DATA CENTERS NETWORKS SERVERS ENERGY SOFTWARE SECURITY Brain-Slug: a BGP-Only SDN for Large-Scale Data-Centers Adel Abouchaev, Tim LaBerge, Petr Lapukhov, Edet Nkposong

Presentation Outline Problem Overview SDN Controller Design Handling Failure Scenarios Feature Roadmap Conclusions

Problem Overview

AS 64XXX AS 64XXX AS 64XXX AS 64XXX BGP-Only Data Centers BGP is the better IGP! Clos topology for network fabric 100K bare metal servers and more! IETF draft available AS 65501 AS 64901 AS 64902 Layer 3 Only (no VLANs spanning) Simple oblivious routing Large fan-outs for big data-centers (32+) Simpler than IGP, less issues Converges fast enough if tuned Single protocol drives simplicity Operated for more than 2 years now

SDN Use Cases for Data-Center First of all, SDN is defined as centralized packet forwarding control But web-scale data-centers are in fact simple A lot of advanced logic is pushed to the servers already E.g. virtual overlays built in software (if any) Traffic filtering, stats collection and analysis Load-balancing and mobility Network is simplified as much as possible Changing routing behavior but why? Remember routing logic is plain ECMP Still forwarding behavior needs to change sometimes

SDN Use Cases for Data-Center Used for load-balancing in the network Or to provide resiliency across the WAN Unequal-cost load distribution in the network E.g. to compensate for various link failures and re-balance traffic More generic traffic engineering for arbitrary topologies Moving Traffic On/Off of Links/Devices Without using any sort of CLI intervention/expect script etc Strive for zero packet loss, graceful shutdown Multiple uses for this simple operation Graceful reload and automated maintenance Automated Isolation of network issues in black box scenarios

Goals and Non-Goals of project - Deploy on existing networks, without software upgrades - Low risk deployment, should have easy rollback story - Leverage existing protocols/functionality - Override some routing behavior, but keep non- SDN paths where possible - Network virtualization or segmentation, etc - Per-flow, highly granular traffic control - Support for non-routed traffic (e.g. L2 VPN) - Optimizing existing protocols or inventing new ones - Full control over all aspects of network behavior (low-level)

SDN Controller Design

Network Setup Controller New configuration added - Template to peer with the central controller (passive listening) AS 65501 - Policy to prefer routes injected from controller - Policy to announce only certain routes to the controller AS 64901 AS 64902 Peering with all devices: multi-hop Key requirement: path resiliency Clos has very rich path set AS 64XXX AS 64XXX AS 64XXX AS 64XXX Network partition is highly unlikely Only Partial Peering Set Displyaed

SDN Controller Design Inject Route Command: Prefix + Next-Hop + Router-ID REST API Command Center BGP Speaker Speaker Thread Write & Notify Network Devices BGP Sessions BGP Listener Decision Thread Listener Thread State Sync Thread Wakeup & Read Shared State Database Receive Route Message: Prefix + Router-ID Network Graph (bootstrap information) BGP Speaker/Listener(s) could scale horizontally (no shared state) Controller stores link-state of the network (next slides) Shared state could be anything e.g. devices to bypass/overload

BGP Speaker/Listener Inject Route Command: Prefix + Next-Hop + Router-ID Does not need to perform best-path selection Does not need to relay BGP updates BGP Speaker BGP Listener [stateless] Tell controller of prefixes received Tell controller of BGP sessions coming up/down Preferably using structured envelope (JSON/XML) BGP Speaker [stateful] API to announce/withdraw a route to a peer Keep state of announced prefixes Note: State not shared among speakers BGP Listener Receive Route Message: Prefix + Router-ID Current P.O.C uses open-source ExaBGP Implementing a simple C# version

Building Network Link State Use a simple form of control plane ping BGP session reflects physical link health Assumes single BGP session b/w two devices Could be always achieved using port-channels Inject Prefix for R1 with one-hop community R1 Controller Expect Prefix for R1 from R2 R2 Create a /32 host route for every device Inject prefix for device X into device X Expect to hear this prefix via all devices Y 1 Y n, directly connected to X If heard, declare link between X and Y as up Community tagging + policy ensures prefix only leaks one hop from point of injection Prefix for R1 relayed R3 Prefix for R1 NOT relayed

Overriding Routing Decisions The controller knows of all edge subnets and devices (e.g. ToRs) Run SPF and compute next-hops (BFS works in most cases, O(m)) For every edge prefix at every device Check if this is different from baseline network graph decisions Only push the deltas Prefixes are pushed with third party next-hops (next slide) Zero delta if no differentсe from baseline routing behavior Controller may declare a link/device down to re-route traffic Implements the overload functionality

Overriding Routing Decisions Injected routes have third-party next-hop Those need to be resolved via BGP Next-hops have to be injected as well A next-hop /32 is created for every device Same one hop BGP community used Same keep-alive prefix could be used as NH Inject Prefix X/24 with Next-Hops: N1, N2 R1 Controller Inject Next-Hop prefix N2 /32 Next-hop prefix: N1 /32 Inject Next-Hop prefix N1/32 R2 Only one path allowed path per BGP session Need either Add-Path or multiple peering sessions Worst case: # sessions = ECMP fan-out Add-Path Receive-Only would help! But model works in either case Next-hop prefix: N2 /32 R3

Overriding Routing Decisions Simple REST to manipulate network state overrides List of the supported calls: - Logically shutdown/un-shutdown a link - Logically shutdown/un-shutdown a device - Announce a prefix with next-hop set via a device - Read current state of the down links/devices etc PUT http://.../overload/link/add=r1,r2&remove=r3,r4 PUT http://.../overload/router/add=r1&remove=r4 State is persistent across controller reboots State is shared across multiple controllers State = overloaded links/devices, static prefixes

Ordered FIB Programming If updating BGP RIB s on devices in no particular order RIB/FIB tables could go out of sync for some time Well-known micro-loops problem! (2) Update these devices second For every link state change Build reverse-spf for link event Update prefixes from leafs to root Controller sequences the updates This link overloaded The updates implode toward the change Packet loss 40-50x times less compared to link shutdown R1 Prefix X S1 R2 S2 R3 (1) Update these devices first

Handling Failure Scenarios

Handling Network Failures Controller may add convergence overhead Only if prefix is controlled by BGP SDN Controller Intervention needed when Primary fails Primary R2 Backup And if no backup paths available locally! Convergence delay includes R1 Backup R3 Detecting link fault/bgp session shutdown Withdrawing the keep-alive prefix Controller finding new next-hop Controller pushing new path information Measured to be < 1 second Controller Intervention not needed when Primary fails (ECMP case) Primary R2 Primary R1 R4 In many cases, backup paths are available Backup (ECMP) Backup (ECMP) E.g. if an uplink fails, BGP re-converges locally R3 Requires that BGP recursive resolution be event driven Possible to do local restoration in FIB agent

Multiple Controllers Run N+1 controllers in all active fashion Any single controller may command the network Inject routes with different MED s according to priority Higher MED paths used as backup This way, backup routes are always in BGP table! Need to share the overloaded link/device information In-memory database, replicated using Paxos algorithm P.O.C. done via ConCoord objects (OpenReplica) ConCoord also used for inter-process coordination Shared database locking Notifying controllers of state change

Multiple Controllers cont. Discovers the following from configuration files: Static network topology Prefixes bound to edge devices IP addresses to peer with Controller s priority Obtains Concoord event object for synchronization Reads the shared state from ConCoord object (R/W lock) Starts peering sessions with all devices Expects X% of all known links to be up before making decisions Injects override routes with MED=100+priority Assumes the reconstructed network state is eventually consistent across the controllers

Feature Roadmap

Multiple Topologies Subset of edge prefixes mapped to a separate logical topology API to create topologies/assign prefixes to topologies API to overload links/devices per topology Separate overloaded links/devices per topology Independent SPF runs per topology Physical fault report raised to all topologies Re-routing subset of traffic, as opposed to all traffic E.g. for automated fault isolation process

Traffic Engineering Link b/w R2 and R4 goes down but R1 does not know that R4 100% This link congested Failures may cause traffic imbalances This includes: R2 R3 Physical failures 50% 50% Logical link/device overloading R1 50% Compute new traffic distribution R4 Program weighted ECMP Signal using BGP Link Bandwidth Not implemented by most vendors 50% R2 Congestion alleviated R3 25% 75% 25% 75% Not just Clos/Tree topologies Think of BGP ASN as logical router R1 Controller installs path with different ECMP weights

Traffic Engineering (cont.) A Requires knowing the following Traffic matrix (TM) Network topology and link capacities Solves Linear Programming problem Compute & push ECMP weights For every prefix At every hop Optimal for a given TM 33% 66% Link state change causes reprogramming More state pushed down to the network More prefixes are now controlled B

Ask to the vendors! Most common HW platforms can do it (BRCM) Signaling via BGP does not look complicated either Note: Has implications on hardware resource usage Localized impact upon ECMP group change Goes naturally with weighted ECMP Well defined in RFC 2992 (ECMP case) Not a standard (sigh) We d really like to have receive-only functionality

Conclusions

Lessons learned Clearly define use cases, don t look for silver bullets Operational implications in front of everything else BGP does not require new firmware or API s Some BGP extensions are nice to have BGP code is pretty mature (for most vendors) Easy to fail-back to regular routing Controller code is very lightweight (<1000 LoC) Solves our current problems and in future allows for much more

References http://datatracker.ietf.org/doc/draft-lapukhov-bgp-routing-large-dc/ http://www.cs.princeton.edu/~jrex/papers/rcp-nsdi.pdf http://code.google.com/p/exabgp/ http://datatracker.ietf.org/doc/draft-ietf-idr-link-bandwidth/ http://openreplica.org/ http://www.cs.cornell.edu/courses/cs7412/2011sp/paxos.pdf http://www.retitlc.polito.it/mellia/corsi/05-06/ottimizzazione/744.pdf http://www.seas.upenn.edu/~guerin/publications/split_hops-dec19.pdf

29 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.