Chapter 6. Paper Study: Data Center Networking

Chapter 6 Paper Study: Data Center Networking 1

Data Center Networks Major theme: What are new networking issues posed by large-scale data centers? Network Architecture? Topology design? Addressing? Routing? Forwarding? Please do the required readings! 2

Data Center Interconnection Structure Nodes in the system: racks of servers How are the nodes (racks) inter-connected? Typically a hierarchical inter-connection structure Today s typical data center structure Cisco recommended data center structure Starting from the bottom level Rack switches 1-2 layers of (layer-2) aggregation switches Access routers Core routers Is such an architecture good enough? 3

Cisco Recommended Data Center Structure Internet Data Center Layer 3 Internet CR CR AR AR AR AR aggregation Layer 2 LB S S LB S S S S Key: CR = L3 Core Router AR = L3 Access Router S = L2 Switch LB = Load Balancer A = Rack of 20 servers with Top of Rack switch 4

Data Center Design Requirements Data centers typically run two types of applications Outward facing (e.g., serving web pages to users) Internal computations (e.g., MapReduce for web indexing) Workloads often unpredictable: Multiple services run concurrently within a DC Demand for new services is unexpected Failures of servers are the norm Recall that GFS, MapReduce, etc., resort to dynamic reassignment of chunkservers, jobs/tasks (worker servers) to deal with failures; data is often replicated across racks, Traffic matrix between servers are constantly changing 5

Data Center Costs Amortized Cost* Component Sub-Components ~45% Servers CPU, memory, disk ~25% Power infrastructure UPS, cooling, power distribution ~15% Power draw Electrical utility costs ~15% Network Switches, links, transit *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money Total cost varies Power consumption in data center Server costs dominate Network costs significant Long provisioning timescales: New servers purchased quarterly at best Source: The Cost of a Cloud: Research Problems in Data Center Networks, Sigcomm CCR 2009. Greenberg, Hamilton, Maltz, Patel. 6

Overall Data Center Design Goal Agility Any service, Any Server Turn the servers into a single large fungible pool Let services breathe : dynamically expand and contract their footprint as needed Benefits This is done in terms of Google s GFS (Google File System), BigTable, MapReduce. Increase service developer productivity Lower cost Achieve high performance and reliability 7

Achieving Agility These are the three motivators for most data center infrastructure projects! 1. Workload Management Means for rapidly installing a service s code on a server Dynamical cluster scheduling and server assignment E.g., MapReduce, Bigtable, Virtual machines, disk images 2. Storage Management Means for a server to access persistent data Distributed file systems (e.g., GFS) 3. Network Management 8 Means for communicating with other servers, regardless of where they are in the data center Achieve high performance and reliability

Networking Objectives 1. Uniform high capacity Capacity between servers limited only by their NICs No need to consider topology when adding servers High capacity between two any servers no matter which racks they are located! 2. Performance isolation Traffic of one service should be unaffected by others 3. Ease of management: Plug-&-Play (layer-2 semantics) Flat addressing, so any server should have IP address Server configuration is the same as in a LAN Legacy applications depending on broadcast must work 9

10 Is Today s DC Architecture Adequate? Hierarchical network; 1+1 redundancy Components of higher DC architecture have capacity to handle more traffic More expensive, more efforts made at availability scale-up design Servers connect via 1 Gbps link to Top-of-Rack switches Other links are mix of 1G, 10G; fiber, copper Uniform high capacity? Performance isolation? Typically via VLANs Agility in terms of dynamically adding or shrinking servers? Agility in terms of adapting to failures, and to traffic dynamics? Ease of management? Internet Data Center Layer 3 Layer 2 S LB Internet CR CR AR AR AR AR S S S S LB S Key: CR = L3 Core Router AR = L3 Access Router S = L2 Switch LB = Load Balancer A = Top of Rack switch

Papers Study 1. A Scalable, Commodity Data Center Network Architecture A new Fat-tree inter-connection structure (topology) to increases bi-section bandwidth Needs new addressing, forwarding/routing 2. VL2: A Scalable and Flexible Data Center Network Consolidate layer-2/layer-3 into a virtual layer 2 Separating naming and addressing, also deal with dynamic load-balancing issues Optional Materials PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric BCube: A High-Performance, Server-centric Network Architecture for Modular Data Centers 11

A Scalable, Commodity Data Center Network Architecture Main Goal: addressing the limitations of today s data center network architecture To avoid the failure of single link or point To provide oversubscription of links in the topology Trade-offs between cost and providing Key Design Considerations/Goals Allows host communication at line speed No matter where they are located! Backwards compatible with existing infrastructure No changes in application & support of layer 2 (Ethernet) Cost effective Cheap infrastructure Low power consumption & heat dissemination M. Al-Fares, A. Loukissas, and A. Vahdat, A scalable,commodity data center network architecture, In Proceedings of the ACM SIGCOMM 2008 conference on Data communication, pages 63 74, 2008.

Fat-Tree Based DC Architecture Inter-connect racks (of servers) using a fat-tree topology Fat-Tree: a special type of Clos Networks K-ary fat tree: three-layer topology (edge, aggregation, and core) Each pod consists of (k/2) 2 servers and 2 layers of k/2 k-port switches Each edge switch connects to k/2 servers and k/2 aggr. switches Each aggr. switch connects to k/2 edge and k/2 core switches (k/2) 2 core switches: each connects to k pods Fat-tree with K=4 13

Why Fat-Tree? Fat-Tree Based Topology Fat tree has identical bandwidth at any bisections Each layer has the same aggregated bandwidth Can be built using cheap devices with uniform capacity Each port supports same speed as end host All devices can transmit at line speed if packets are distributed uniform along available paths Great scalability: k-port switch supports k 3 /4 servers Fat tree network with K = 3 supporting 54 hosts 14

Cost of Maintaining Switches 1G Switch 10G Switch 15

Fat-tree Topology is Great, But Does using fat-tree topology to inter-connect racks of servers in itself sufficient? What routing protocols should we run on these switches? Layer 2 switch algorithm: data plane flooding! Layer 3 IP routing: shortest path IP routing will typically use only one path despite the path diversity in the topology if using equal-cost multi-path routing at each switch independently and blindly, packet reordering may occur; further load may not necessarily be well-balanced Aside: control plane flooding! 16

FAT-Tree Modified Enforce a special (IP) addressing scheme in DC unused.podnumber.switchnumber.endhost Allows host attached to same switch to route only through switch Allows inter-pod traffic to stay within pod Use two level look-ups to distribute traffic and maintain packet ordering First level is prefix lookup Second level is a suffix lookup 17

More on Fat-Tree DC Architecture Diffusion Optimizations Flow classification Eliminates local congestion Assign to traffic to ports on a per-flow basis instead of a per-host basis Flow scheduling Eliminates global congestion Prevent long lived flows sharing the same links Assign long lived flows to different links What are potential drawbacks of this architecture? 18

VL2: A Scalable and Flexible Data Center Network Main Goal: support agility & be cost-effective A virtual (logical) layer 2 architecture for connecting racks of servers (network as a big virtual switch ) Employs a 3-level Clos topology (full-mesh in top-2 levels) with non-uniform switch capacities Also provides identity and location separation application-specific vs. location-specific addresses Employs a directory service for name resolution Needs a direct host participation Explicitly accounts for DC traffic matrix dynamics Employs the Valiant load-balancing (VLB) technique Using randomization to cope with volatility A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta, Vl2: A scalable and flexible data center network, 19 In Proceedings of the ACM SIGCOMM 2009 conference on Data communication, pages 51 62, 2009.

Clos network Clos network is a kind of multistage circuit switching network. Charles Clos proposed in 1953 It represents a theoretical idealization of practical multistage telephone switching systems. Clos networks are required when the physical circuit switching needs exceed the capacity of the largest feasible single crossbar switch. The key advantage of Clos networks is that the number of crosspoints (which make up each crossbar switch) required can be much fewer than were the entire switching system implemented with one large crossbar switch. 20

Clos network Clos networks have three stages: the ingress stage, middle stage, and egress stage. Each stage is made up of a number of crossbar switches (see diagram below), often just called crossbars. Each call entering an ingress crossbar switch can be routed through any of the available middle stage crossbar switches, to the relevant egress crossbar switch. A middle stage crossbar is available for a particular new call if both the link connecting the ingress switch to the middle stage switch. the link connecting the middle stage switch to the egress switch, are free. 21

Clos network Clos networks are defined by three integers n, m, and r. n represents the number of sources which feed into each of r ingress stage crossbar switches. Each ingress stage crossbar switch has m outlets, and there are m middle stage crossbar switches. 22

Bidirectional Paths

Routing in Bidirectional Paths Networks are multi-path Routing takes place in two steps: route to an intermediate node followed by routing to destination 1. Multiple intermediate nodes can be selected 2. Path from intermediate node to destination is unique

Moving to Fat Trees Folded Clos = Folded Benes = Fat tree network 0 1 2 3 Nodes at tree leaves Network Bisection Switches at tree vertices 4 5 6 7 8 9 10 11 12 13 14 15 Total link bandwidth is constant across all tree levels, with full bisection bandwidth Equivalent to folded Benes topology Preferred topology in many system area networks T.M. Pinkston, J. Duato, with major contributions by J. Filch

Fat Trees: Another View Forward Backward Equivalent to the preceding multistage implementation Common topology in many supercomputer installations

27 Specific Objectives and Solutions

Scale-out vs. Scale-up VL2 Topology Design Argue for and exploit the gap in switch-to-switch capacity vs. switch-to-server capacities current: 10Gbps vs. 1Gbps future: 40 Gpbs vs. 10 Gbps A scale-out design with broad layers 28 E.g., a 3-level Clos topology with full-mesh in top-2 levels ToR switches, aggregation switches, & core (intermediate) switches Less wiring complexity, and more path diversity Same bisection capacity at each layer No oversubscription Extensive path diversity Graceful degradation under failure

29 VL2 Topology: Example

Addressing and routing Address resolution and Packet forwarding LA (location-specific IP address) Assigned IP for all switches and interfaces. Forward any packets encapsulated with LAs along the shortest path. AA (application-specific IP address) Associated with an LA. ToR switch s IP to which app. server is connected.

Addressing and routing(cont.) Packet forwarding step Server receive sender s packet to encapsulate. Setting the dest. of the outer header to the LA of the dest. AA. Once the packet arrives at the LA. Dest. switch decapsulates the packet and delivers to dest. AA. Address resolution and access control When first time to send, the host generate ARP request for AA. Source s network stack intercepts ARP request and converts it to a unicast query to directory server (DS).

Address Resolution and Packet Forwarding

Addressing and Routing: Name-Location Separation Cope with host churns with very little overhead Allows to use low-cost switches Protects network and hosts from host-state churn Obviates host and switch reconfiguration Switches run link-state routing and maintain only switch-level topology......... ToR 1 ToR 2 ToR 3 ToR 4 Directory Service x ToR 2 y ToR 3 z ToR 43 ToR 3 ToR 43 y z payload payload x y, y z z Lookup & Response 33 Servers use flat names

Maintain host information using DS DS provides 2 key functions 1. Lookups and update for AA-to-LA mapping. 2. A reactive cache update mechanism that ensures eventual consistency of the mapping with very litter update overhead. The differ performance in DS architecture Read-optimized Replicated lookup server that cache AA-to-LA mapping and communicated with agents. Write-optimized Asynchronous replicated state machine (RSM) server offering a strongly consistent, reliable store of AA-to-LA mapping.

Directory System Replicated state machine (RSM): Offer a strongly consistent, reliable store of AA-to-LA mapping

D/2 ports D/2 ports Use Randomization to Cope with D ports 10G D switches 20 ports [D 2 /4] * 20 Servers D/2 switches Valiant Load Balancing Volatility...... Top Of Rack switch Intermediate node switches in VLB Aggregation switches Every flow bounced off a random intermediate switch Provable hotspot free for any admissible traffic matrix Servers could randomize flow-lets if needed Node degree (D) of available switches & # servers supported D # Servers in pool 4 80 24 2,880 48 11,520 144 103,680

VL2 Summary VL2 achieves agility at scale via 1. L2 semantics 2. Uniform high capacity between servers 3. Performance isolation between services Lessons Randomization can tame volatility Add functionality where you have control There s no need to wait! 38

Additional Case Studies Optional Material PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric Main idea: new hierarchical addressing scheme to facilitate dynamic and fault-tolerant routing/forwarding 39 R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat, Portland: a scalable fault-tolerant layer 2 data center network fabric, In Proceedings of the ACM SIGCOMM 2009 conference on Data communication, pages 39 50, 2009

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric PortLand is a single logical layer 2 data center network fabric that scales to millions of endpoints PortLand internally separates host identity from host location uses IP address as host identifier introduces Pseudo MAC (PMAC) addresses internally to encode endpoint location PortLand runs on commodity switch hardware with unmodified hosts 40

PortLand Requirements Any VM may migrate to any physical machine. Migrating VMs should not have to change their IP addresses. However, this action will break pre-existing TCP connections and application-level state. An administrator should not need to configure any switch before deployment. Any end host should be able to efficiently communicate with any other end host in the data center along any of the available physical communication paths. There should be no forwarding loops. Failures will be common at scale, so failure detection should be rapid and efficient. Existing unicast and multicast sessions should proceed unaffected to the extent allowed by underlying physical connectivity. 41

Design Goals for Network Fabric Support for Agility! Easy configuration and management: plug-&-play Fault tolerance, routing and addressing: scalability Commodity switch hardware: small switch state Virtualization support: seamless VM migration What are the limitations of current layer-2 and layer-3? layer-2 (Ethernet w/ flat-addressing) vs. layer-3 (IP w/ prefix-based addressing): 42 plug-&-play? scalability? small switch state? seamless VM migration?

PortLand Solution Assuming: a Fat-tree network topology for DC Introduce pseudo MAC addresses to balance the pros and cons of flat- vs. topology-dependent addressing PMACs are topology-dependent hierarchical addresses Only used host locators, not host identities IP addresses used as host identities (for compatibility w/ apps) Pros: Small switch state & Seamless VM migration Eliminate flooding in both data & control planes But requires a IP-to-PMAC mapping and name resolution A location directory service A location discovery protocol & fabric manager For support of plug-&-play 43

PMAC Addressing Scheme PMAC (48 bits): pod.position.port.vmid Pod: 16 bits; position and port (8 bits); vmid: 16 bits Assign only to servers (end-hosts) by switches pod position 44

Location Discovery Protocol Location Discovery Messages (LDMs) exchanged between neighboring switches LDMS contain the following information: switch identifier, pod number, position, tree level and up/down. 45

46 PortLand: Name Resolution Edge switch listens to end hosts, and discover new source MACs Installs <IP, PMAC> mappings, and informs fabric manager 1 2 3

PortLand: Name Resolution 1. Edge switch intercepts ARP messages from end hosts 2. Send request to fabric manager, which replies with PMAC 47

PortLand: Fabric Manager Fabric Manager: logically centralized, multi-homed server Maintains topology and <IP,PMAC> mappings in soft state 48

Loop-free Forwarding and Fault-Tolerant Routing Switches build forwarding tables based on their position Include edge, aggregation, and core switches Use strict up-down semantics to ensure loop-free forwarding Load-balancing: use any equal-cost multipath routing (ECMP) path via flow hashing to ensure packet ordering Fault-tolerant routing: Mostly concerned with detecting failures Fabric manager maintains logical fault matrix with per-link connectivity info to inform affected switches Affected switches re-compute forwarding tables 49

Fault-Tolerant Routing 1. Upon not receiving an LDM (also referred to as a keepalive in this context) for some configurable period of time, a switch assumes a link failure. 1. A keepalive (KA) is a message sent by one device to another to check that the link between the two is operating, or to prevent this link from being broken 2. The detecting switch informs the fabric manager about the failure. 50

Fault-Tolerant Routing 3. The fabric manager maintains a logical fault matrix with per-link connectivity information for the entire topology and updates it with the new information. 4. The fabric manager informs all affected switches of the failure, which then individually recalculate their forwarding tables based on the new version of the topology. 51

Multicast: Fault detection and action There are one sender (pod 1) and three receivers, spread across pods 0 and 1. A sender forwards packets to the designated core, which in turn distributes the packets to the receivers. In step 1, two highlighted links in pod 0 simultaneously fail. 52

Multicast: Fault detection and action Two aggregation switches detect the failure in step 2 and notify the fabric manager, which in turn updates its fault matrix in step 3. The fabric manager calculates forwarding entries for all affected multicast groups in step 4. 53

BCube: A High-Performance, Server-centric Network Architecture for Modular Data Centers Main Goal: network architecture for shippingcontainer based modular data centers Designed for shipping-container modular DC BCube construction: level structure BCube k recursively constructed from BCube 1 Server-centric: Servers perform routing and forwarding Consider a variety of communication patterns one-to-one, one-to-many, one-to-all, all-to-all single path and multi-path routing (Will not discuss in details; please read the paper!) 54

Bcube construction (Bcube k,n) switch server n : sever number K : level

Bcube 1

Speedup for One-to-all Traffic In one-to-all, a source server delivers a file to all the other servers. It is easy to see that under tree and fat-tree, the time for all the receivers to receive the file is at least L. A source can deliver a file of size L to all the other servers in L /k+1 time in a BCube k. Constructing k+1 edge-disjoint server spanning trees from the k + 1 neighbors of the source. 57

58 One-to-All Traffic Forwarding Using a spanning tree Speed-up: L/(k+1) for a file of size L Two-edge disjoint (server) spanning trees in BCube 1 for one-to-traffic