Network Virtualization and Data Center Networks 263-3825-00 DC Virtualization Basics Part 3 Qin Yin Fall Semester 2013 1
Outline A Brief History of Distributed Data Centers The Case for Layer 2 Extension Layer 2 Extension Over optical Connections Virtual PortChannels Fabric Path Over MPLS Ethernet over MPLS (EoMPLS) Virtual Private LAN Service (VPLS) Over IP MPLS over GRE Overlay Transport Virtualization (OTV) 2
Virtual PortChannel Summary Virtualization Characteristics Emulation Type Subtype Scalability Technology area Subarea Advantages Virtual PortChannel Single Ethernet Switch Pooling Homogeneous Two physical switches Networking Data and control plane virtualization High throughput, faster fault recovery, less complexity 3
Distributed Data Centers Data center interconnect (DCI) Many physical sites, one logical data center Business goals Seamless workload mobility Business continuity Pool and maximize global resources Distributed applications Defining two metrics for each application environment Recovery Point Objective: The maximum tolerable amount of time in which data can be lost from an IT service Recovery Time Objective: The maximum tolerable amount of time after which an IT service must be operational 4
From a networking perspective Connected through Layer 3 routing Decreases fate sharing in distributed data center Isolates each site from remote network instabilities Extending Layer 2 domains More details later 5
The Cold Age (Mid-1970s to 1980s) Computer rooms housing mainframe systems Applications based on batch processing RPO and RTO could span days or even weeks Recovery technologies: data backup and retrieval Data: stored on tapes Connectivity: physical transport of tapes to the backup site Cold-standby and warm-standby Data retrieved and delivered to the application Manual intervention to achieve recovery objectives 6
The Hot Age (1990s to Mid-2000s) Internet booms and the advent of electronic business The need of real-time response Recovery technologies focused on service availability RPO and RTO confined to hours, or even minutes Geographic clusters (geocluster) Application servers installed on at least two geo-separated sites Active node failure triggers automatic switchover to standby node Generally require data replication to the hot-standby site Synchronous replication (tens of kilometers apart to avoid latency issues) Asynchronous replication (data periodically copied from primary to secondary site) 7
Geocluster Different types of geocluster communication Heartbeat communication Application state information (such as cached data for database servers) Client traffic (especially nodes share the same virtual IP address) 8
The Active-Active Age (Mid-2000s -) In hot-standby site, hardware and software resources Used in case of major failure at the main site Activated for a small amount of time per year Some critical applications (RPO of 0, RTO of seconds) Active-active design to avoid resource waste No luxury of unused sites Deploy several active nodes dispersed over multiple data centers Server and storage virtualization to provide automatic and quick workload mobility between sites Challenges of scalability and flexibility One most discussed topic: Layer 2 extensions between remote sites 9
Requirements of Layer 2 Extension Heartbeat and connection state communications are usually directed to multiple destinations Broadcast, multicast, or unknown unicast Ethernet frames (flooding) Active and standby nodes usually share the same virtual IP and MAC address To facilitate traffic handling in the case of failure Server migration Application do not support IP readdressing Generates painstakingly complex operations Data center expansion A data center has reached a physical limitation A company hires a colocation service from an outsourcing data center As a result, standard Layer 2 connection are deployed to provide extended VLANs over multiple data centers. 10
Challenges of Layer 2 Extension Flooding and broadcast Loops over the Layer 2 extensions can be easily formed A spanning tree instance spanning multiple sites presents formidable challenges Scalability: recommended STP diameter is 7 Isolation: reconvergence will affect VLANs within one STP instance Multihoming: multiple DCI links will not be used for data comm 11
Challenges of Layer 2 Extension (cont.) Tromboning can be formed between data centers Non-optimal internal routing within extended VLANs Cause for DCI resource waste: uncontrolled state of an active-standby pair of devices Data confidentiality Mandates strict forms of encryption in data center interconnect to minimize the risk of data leakage Tromboning in action 12
Traditional Layer 2 VPNs Dark Fiber VPLS EoMPLS 13
Ethernet Extensions over Optical Connections Optical connections Distance: less than a few hundred kilometers Dark fiber Fiber-optic pair to connect networking devices Wavelength-division multiplexing (WDM) to increase transport capacity Coarse WDM: multiplex eight optical carrier signals Dense WDM: aggregate a higher number (128, for example) Dark fiber and WDM Communication solutions belonging to Layer 1 (physical) Can transport any data-link protocol including Ethernet 14
Spanning Tree Protocol STP does not allow Ethernet traffic on all the links between DCI switches STP instance Is spread over both sites Sharing any internal topology change or reconvergence 15
STP and Link Utilization STP wastes inter-switch connection resources 16
Link Aggregation STP only detects one logical interface Traffic destined to this interface is load balanced among the active physical links that are part of the channel This virtual interface is denominated a PortChannel 17
Virtual PortChannel Eliminate STP blocked ports Uses all available uplink bandwidth Allow a single device to use a port channel across two upstream switches Dual-homed server operate in active-active mode Provide fast convergence upon link/device failure 18
Virtual PortChannels on a Layer 2 Extension Virtual ProtChannels Transforms multiple Ethernet links into a single-switch STP connection Benefits Multihoming is enabled all links are being used (and load balanced) Spanning tree topology is simplified only one connection between sites If vpc peer switch feature is deployed, a device failure will not result in reconvergence 19
Virtual PortChannels in Multipoint Data Center Connections Problem vpcs can form a logical looped topology Solution: hub-and-spoke Deploy disjoint STP instances per site STP isolation is enabled on all DCI switches Avoid loops in the Layer 2 extension 20
Traditional Layer 2 VPNs Dark Fiber VPLS EoMPLS 21
MPLS Labels and Packets Provides packet forwarding based on labels Layer 2.5 technology Head fields Label value Experimental (Exp) To define QoS classes in MPLS networks Bottom of Stack (B) Time to live (TTL) 22
MPLS Basics Protocol flexibility Comes from the capability of stacking labels MPLS services Traffic engineering Configures and defines unidirectional tunnels using tunnel label Override routing protocol decision Layer 3 virtual private networks Connect different VPNS Inner label: VPN Any transport over MPLS (AToM) Transport of Layer 2 frames Inner label: virtual circuit Example: EoMPLS 23
MPLS Network Forwarding Equivalence Class (FEC) MPLS packets sharing the same label Two types of routers Label Edge Routers (LER) Label Switch Router (LSR) MPLS router elements A loopback interface To improve reachability A routing protocol To advertise connected subnets LDP - Label distribution Protocol To enable device discovery and label distribution MPLS interfaces between routers 24
EoMPLS Configuration In essence Within MPLS network Encapsulates Ethernet frames within MPLS packets At the egress of MPLS network Transported, de-capsulated and delivered as they were MPLS label stack Tunnel Label routing from ingress to egress LER VC Label identifying virtual circuit within tunnel Pseudowire Emulation of Ethernet cable 25
Pseudo Wire Reference Model A Pseudo Wire (PW) is a connection between two provider edge (PE) devices connecting two attachment circuits (ACs) Label Switched Path (LSP) MPLS tunnel Emulated Service Pseudo Wire Customer Site PSN Tunnel (LSP in MPLS) Customer Site Attachment Circuit MPLS (or IP) PW1 PW2 Customer Site PE1 PE2 Pseudo Wire PDUs Customer Site Packet Switched Network (PSN) IP or MPLS 26
VC Distribution Mechanism using LDP Unidirectional Tunnel LSP To transport PW PDU from PE to PE based on tunnel label(s) Both LSPs combined to form a single bi-directional Pseudo Wire Directed LDP session To exchange VC information, such as VC label and control information Directed LDP Session between PE1 and PE2 Tunnel Label(s) gets to PE router Customer Site Label Switch Path IP/MPLS Customer Site Customer Site PE1 LSP created using IGP+LDP or RSVP-TE PE2 VC Label identifies interface Customer Site 27
Ethernet PW Tunnel Encapsulation 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Tunnel Encaps Tunnel Label (LDP,RSVP,BGP) EXP 0 TTL PW Demux Control Word VC Label (VC) EXP 1 TTL (set to 2) 0 0 0 0 Reserved Sequence Number Layer-2 PDU Tunnel Encapsulation One or more MPLS labels associated with the tunnel Defines the LSP from ingress to egress PE router 28
Ethernet PW Demultiplexer 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Tunnel Encaps Tunnel Label (LDP,RSVP,BGP) EXP 0 TTL PW Demux Control Word VC Label (VC) EXP 1 TTL (set to 2) 0 0 0 0 Reserved Sequence Number Layer-2 PDU Obtained from Directed LDP session To identify individual circuits within a tunnel Used by receiving PE to determine Egress interface for L2PDU forwarding (Port based) Egress VLAN used on the facing interface (VLAN Based) EXP can be set to the values received in the L2 frame 29
PW Operation and Encapsulation Label 72 for PW1 Directed LDP Session between PE1 and PE2 PW1 Lo0: IP/MPLS P1 24 38 LSP 72 P2 L2 PDU Customer Site PE1 Label Pop for Lo0: Label 38 for Lo0: Label 24 for Lo0: PE2 Customer Site LDP Session LDP Session LDP Session This process happens in both directions (Example shows process for PE2 -> PE1 traffic) 30
Virtual Private LAN Service End-to-end architecture allowing MPLS networks offer Layer 2 multipoint Ethernet Services Provides emulation of a single virtual Ethernet bridge network Virtual Bridges linked with MPLS Pseudo Wires VPLS is an Architecture PE PE Data Plane used is same as EoMPLS (point-to-point) 31
Virtual Private LAN Service It is Virtual Multiple instances share the same physical infrastructure It is Private Each instance is independent and isolated from one another It is LAN Service It emulates Layer 2 multipoint connectivity between subscribers 32
VPLS Components Attachment circuits: Port or VLAN mode router N-PE Mesh of LSP between N-PEs: Pseudo Wires within LSP N-PE Virtual Switch Interface (VSI) terminates PW and provides Ethernet bridge function router router router switch MPLS Core switch LDP between PEs used to exchange VC and tunnel labels for Pseudo Wires switch router Attachment : switch or router N-PE 33
Virtual Switch Interface Flooding / Forwarding MAC table instances per customer (port/vlan) for each PE Associate ports to MAC, flood unknowns to all other ports Address Learning / Aging LDP enhanced with additional MAC list TLV (label withdrawal) MAC timers refreshed with incoming frames Loop Prevention Create a full-mesh of Pseudo Wires (VCs in EoMPLS) Unidirectional LSP carries VCs between pair of N-PEs split horizon concepts to prevent loops 34
VPLS Flooding and Forwarding Unknown DA? Pseudo Wire in LSP Data SA DA? Flooding (Broadcast, Multicast, Unknown Unicast) Dynamic learning of MAC addresses on PHY and VCs Forwarding Physical Port Virtual Circuit 35
MAC Learning and Forwarding Send me frames using Label 102 MAC1 PE1 MAC Address MAC 2 170 MAC 1 E0/0 Send me frames using Label 170 MAC2 PE2 Use VC Label 102 E0/0 Use VC E0/1 Label 170 Adj MAC Address Data MAC1 Directed LDP PE2 102 MAC1 MAC2 Data MAC2 MAC 2 E0/1 Broadcast, Multicast, and Unknown Unicast are learned via the received label associations Two LSPs associated with a VC (Tx & Rx) If inbound or outbound LSP is down Then the entire Pseudo Wire is considered down 170 PE2 Adj MAC 1 102 36
MAC Address Withdraw Directed LDP X MPLS Message speeds up convergence process Otherwise PE relies on MAC Address Aging Timer Upon failure, PE removes locally learned MAC addresses Send LDP Address Withdraw to remote PEs in VPLS (using the Directed LDP session) New MAC List TLV is used to withdraw addresses 37
VPLS Functional Components Customer MxUs SP PoPs Customer MxUs U-PE N-PE MPLS Core N-PE U-PE N-PE provides VPLS termination/l3 services U-PE provides customer UNI is the custome device 38
Directed Attachment (Flat) Characteristics Suitable for simple/small implementations Full mesh of directed LDP sessions required N*(N-1)/2 Pseudo Wires required Scalability issue a number of PE routers grows No hierarchical scalability VLAN and Port level support Potential signaling and packet replication overhead Large amount of multicast replication over same physical CPU overhead for replication 39
Direct Attachment VPLS (Flat Architecture) N-PE MPLS Core N-PE Ethernet (VLAN/Port Full Mesh PWs + LDP Ethernet (VLAN Port) Data MAC1 MAC2 802.1q Customer Data MAC1 MAC2 Data MAC1 MAC2 VC PE Pseudo Wire SP Core 40
Hierarchical VPLS (H-VPLS) Best for larger scale deployment Reduction in packet replication and signaling overhead Consists of two levels in a Hub and Spoke topology Hub consists of full mesh VPLS Pseudo Wires in MPLS core Spokes consist of L2/L3 tunnels connecting to VPLS (Hub) PEs Q-in-Q (L2), MPLS (L3), L2TPv3 (L3) 41
Why H-VPLS? PE VPLS PE PE H-VPLS PE-rs MTU-s PE PE PE-rs PE-rs PE PE PE-rs PE-r PE Potential signaling overhead Full PW mesh from the Edge Packet replication done at the Edge Node Discovery and Provisioning extends end to end PE-rs PE-rs Minimizes signaling overhead Full PW mesh among Core devices Packet replication done the Core Partitions Node Discovery process 42
MPLS Edge H-VPLS U-PE PE-rs N-PE PE-rs MPLS Core N-PE PE-rs U-PE PE-rs MPLS Acces s MPLS Core MPLS Acces s 1 2 802.1q Access MPLS Pseudo Wire 3 Full Mesh PWs + LDP MPLS Pseudo Wire 802.1q Access 1 Data 2 Vlan Data MAC1 Vlan MAC2 MAC1 802.1q Customer 3 MAC2 Data VC PE Vlan MPLS PW SP Edge MAC1 MAC2 Same VCID used in Edge and core (Labels may differ) VC P E Pseudo Wire SP Core 43
Layer 2 VPNs Dark Fiber VPLS EoMPLS 44
Flooding Behavior Traditional Layer 2 VPN technologies rely on flooding to propagate MAC reachability The flooding behavior causes failures to propagate to every site in the Layer 2 VPN Goal Providing layer 2 connectivity, yet restrict the reach of the unknown unicast flooding domain in order to contain failures and preserve the resiliency 45
Pseudo Wires Maintenance Before any learning can happen a full mesh of pseudo-wires/ tunnels must be in place For N sites, there will be N*(N-1)/2 pseudo-wires. Complex to add and remove sites Head-end replication for multicast and broadcast. Sub-optimal BW utilization Goal providing point-to-cloud provisioning and optimal bandwidth utilization in order to reduce cost 46
Multi-homing Requires additional protocols (BGP, ICC, EEM) STP often extended Malfunctions impact all sites Goal Natively providing automatic detection of multihoming without the need of extending the STP domains, together with a more efficient loadbalancing 47
OTV Changes the Game Circuits + Data Plane Flooding Full mesh of circuits MAC learning based on flooding Tunnels and Pseudo Wires Operationally challenging Loop prevention Multi-homing Packet + Control Protocol Learning Packet switched connectivity MAC learning by control protocol Dynamic encapsulation Operational simplification Automatic loop prevention and multi-homing 48
Overlay Transport Virtualization OTV delivers a virtual L2 transport over any L3 Infrastructure Overlay Independent of the Infrastructure technology and services, flexible over various inter-connect facilities Transport Transport services for Layer 2 and Layer 3 Ethernet and IP traffic Virtualization Provides virtual stateless multi-access connections. Can be further partitioned into VPNs, VRFs, VLANs 49
OTV Control Plane MAC Learning a. Server with MAC address X sends frames that are flooded or broadcasted within site b. OTV1 learns MAC X and populates its MAC address table. c. OTV1 advertises MAC X with an IS-IS update. d. OTV2 and OTV3 become aware that MAC X can be reached through OTV1 and populate their MAC address tables using the virtual Layer 2 interface called Overlay 50
OTV Frame Forwarding a. Server2 sends a unicast frame destined to MAC X that is flooded to OTV2. b. OTV2 checks its MAC address table and realizes that the MAC X entry points to an Overlay interface. c. Internally in OTV2, this Overlay interface provides a mapping to OTV1 s IP address. As a result, the unicast frame is encapsulated into an IP packet directed to OTV1. d. OTV1 receives the IP packet and decapsulates it, recovering the original Ethernet frame. e. OTV1 uses its local MAC address table to forward the frame to Server1. 51
OTV Encapsulation Outer IP header Outer OTV shim header VLAN Overlay number Layer 2 header 52
OTV elements Edge device: network equipment that is actually deploying OTV - Internal interface: connected to a Layer 2 network - To process Ethernet frames - Join interface: connected to the Layer 3 network - To send or receive OTV packets 53
OTV elements Overlay interface: - A virtual Layer 2 interface that represents an OTV Layer 2 extension to other edge devices - Used on their MAC address tables as the interface associated to remote MAC addresses - Always associated to a join interface 54
OTV elements Site VLAN: - A dedicated VLAN used for discovery and adjacency maintenance between edge devices on the same site - Should not be extended to other sites. 55
Spanning Tree and OTV OTV is site transparent: no changes to the STP topology Each site keeps its own STP domain An Edge Device will send and receive BPDUs ONLY on the OTV Internal Interfaces 56
OTV Loop Avoidance Blocking unknown unicast traffic between edge devices Authoritative edge device (AED) The only edge device on a site handling multicast and broadcast traffic for the OTV-extended VLAN 57
OTV and Multi-homing OTV built-in multi-homing Allows Layer 2 traffic to be load balanced through different IP WAN links OTV multi-homing options Automatic distribution of VLANs among the available AED candidates (a hashing function to deploy this distribution). For unicast egress traffic, OTV can be load balanced among all the equal-cost Layer 3 paths to remote edge devices. Multidestination egress and ingress traffic can only use the join interface. Layer 3 PortChannels (between AED and a single device or one deploying VSS). 58
References Jeff Apcar. An introduction to VPLS. http://stor.balios.net/divers/vpls_introduction.ppt Peter Lam, Patrick Warichet. Simplifying Data Center Interconnect with Overlay Transport Virtualization (OTV). 59
Ethernet over MPLS Summary Virtualization Characteristics Emulation Type Subtype Scalability Technology area Subarea Advantages Ethernet over MPLS Ethernet connection Abstraction Structural Hardware and software dependent Networking Data plane virtualization Layer 2 extension, simplicity, transparency 60
Virtual Private LAN Summary Virtualization Characteristics Emulation Type Subtype Scalability Technology area Subarea Advantages Virtual Private LAN Ethernet bridge Abstraction Structural Hardware and software dependent Networking Data plane virtualization Layer 2 extension, multipoint connections, loop avoidance within MPLS network 61
Overlay Transport Virtualization Virtualization Characteristics Emulation Type Subtype Scalability Technology area Subarea Advantages Overlay Transport Virtualization Overlay Ethernet network Abstraction Structural Hardware and software dependent Networking Data and control planes virtualization Layer 2 extension, multipoint connections, transport independence, loop avoidance 62