Advanced Computer Networks. Datacenter Network Fabric

Similar documents
Data Center Network Topologies: FatTree

Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya and Amin Vahdat

Portland: how to use the topology feature of the datacenter network to scale routing and forwarding

Outline. VL2: A Scalable and Flexible Data Center Network. Problem. Introduction 11/26/2012

Scale and Efficiency in Data Center Networks

Lecture 7: Data Center Networks"

Chapter 6. Paper Study: Data Center Networking

TRILL Large Layer 2 Network Solution

PortLand:! A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

OVERLAYING VIRTUALIZED LAYER 2 NETWORKS OVER LAYER 3 NETWORKS

OpenFlow based Load Balancing for Fat-Tree Networks with Multipath Support

TRILL for Service Provider Data Center and IXP. Francois Tallet, Cisco Systems

Data Center Networking

Load Balancing Mechanisms in Data Center Networks

Non-blocking Switching in the Cloud Computing Era

Torii-HLMAC: Torii-HLMAC: Fat Tree Data Center Architecture Elisa Rojas University of Alcala (Spain)

Data Center Network Topologies: VL2 (Virtual Layer 2)

VMDC 3.0 Design Overview

Network Virtualization and Data Center Networks Data Center Virtualization - Basics. Qin Yin Fall Semester 2013

Ethernet-based Software Defined Network (SDN) Cloud Computing Research Center for Mobile Applications (CCMA), ITRI 雲 端 運 算 行 動 應 用 研 究 中 心

Advanced Computer Networks. Layer-7-Switching and Loadbalancing

Scalable Data Center Networking. Amin Vahdat Computer Science and Engineering UC San Diego

Ethernet Fabrics: An Architecture for Cloud Networking

Scalable Approaches for Multitenant Cloud Data Centers

DATA center infrastructure design has recently been receiving

Advanced Computer Networks. High Performance Networking I

Data Center Network Topologies

Advanced Computer Networks. Scheduling

Migrate from Cisco Catalyst 6500 Series Switches to Cisco Nexus 9000 Series Switches

Optimizing Data Center Networks for Cloud Computing

Simplify Your Data Center Network to Improve Performance and Decrease Costs

Data Center Infrastructure of the future. Alexei Agueev, Systems Engineer

Analysis of Network Segmentation Techniques in Cloud Data Centers

Deploying Brocade VDX 6720 Data Center Switches with Brocade VCS in Enterprise Data Centers

SDN and Data Center Networks

Data Center Network Architectures

Panel: Cloud/SDN/NFV 黃 仁 竑 教 授 國 立 中 正 大 學 資 工 系 2015/12/26

How To Make A Vpc More Secure With A Cloud Network Overlay (Network) On A Vlan) On An Openstack Vlan On A Server On A Network On A 2D (Vlan) (Vpn) On Your Vlan

Scaling 10Gb/s Clustering at Wire-Speed

Walmart s Data Center. Amadeus Data Center. Google s Data Center. Data Center Evolution 1.0. Data Center Evolution 2.0

基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器

Extending Networking to Fit the Cloud

Brocade One Data Center Cloud-Optimized Networks

Multipath TCP in Data Centres (work in progress)

AIN: A Blueprint for an All-IP Data Center Network

VXLAN: Scaling Data Center Capacity. White Paper

Data Center Networking Designing Today s Data Center

Virtualizing the SAN with Software Defined Storage Networks

Virtual PortChannels: Building Networks without Spanning Tree Protocol

Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches

TRILL for Data Center Networks

Simplify the Data Center with Junos Fusion

Data Center Design IP Network Infrastructure

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Broadcom Smart-NV Technology for Cloud-Scale Network Virtualization. Sujal Das Product Marketing Director Network Switching

Operating Systems. Cloud Computing and Data Centers

SummitStack in the Data Center

Data Center Networks

Juniper Networks QFabric: Scaling for the Modern Data Center

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

White Paper. Advanced Server Network Virtualization (NV) Acceleration for VXLAN

Accelerating Network Virtualization Overlays with QLogic Intelligent Ethernet Adapters

Brocade Solution for EMC VSPEX Server Virtualization

SummitStack in the Data Center

Network Technologies for Next-generation Data Centers

DEMYSTIFYING ROUTING SERVICES IN SOFTWAREDEFINED NETWORKING

MMPTCP: A Novel Transport Protocol for Data Centre Networks

Multi-site Datacenter Network Infrastructures

Large-Scale Distributed Systems. Datacenter Networks. COMP6511A Spring 2014 HKUST. Lin Gu

Brocade Data Center Fabric Architectures

Data Center Network Virtualization: A Survey

VMware. NSX Network Virtualization Design Guide

Cloud Computing and the Internet. Conferenza GARR 2010

Simplifying Data Center Network Architecture: Collapsing the Tiers

Applying NOX to the Datacenter

Paolo Costa

VMware and Brocade Network Virtualization Reference Whitepaper

Brocade Data Center Fabric Architectures

Addressing Scaling Challenges in the Data Center

Data Center Convergence. Ahmad Zamer, Brocade

Extreme Networks: Building Cloud-Scale Networks Using Open Fabric Architectures A SOLUTION WHITE PAPER

Connecting Physical and Virtual Networks with VMware NSX and Juniper Platforms. Technical Whitepaper. Whitepaper/ 1

Introduction to Cloud Design Four Design Principals For IaaS

Ethernet-based Software Defined Network (SDN)

VXLAN Bridging & Routing

Next Steps Toward 10 Gigabit Ethernet Top-of-Rack Networking

STATE OF THE ART OF DATA CENTRE NETWORK TECHNOLOGIES CASE: COMPARISON BETWEEN ETHERNET FABRIC SOLUTIONS

Building Tomorrow s Data Center Network Today

The Impact of Virtualization on Cloud Networking Arista Networks Whitepaper

Transcription:

Advanced Computer Networks 263 3501 00 Datacenter Network Fabric Patrick Stuedi Spring Semester 2014 Oriana Riva, Department of Computer Science ETH Zürich 1

Outline Last week Today Supercomputer networking Direct network topologies RDMA Datacenter networking Scaling Ethernet 2

Remember: Overview High-Performance Computing (HPC) Supercomputers Systems designed from scratch Densely packed Short rack-to-rack cables Expensive Built from custom high-end components Mostly run a single program at a time, e.g., message passing (MPI) applications Cloud Computing Datacenters built from commodity off-the-shelf hardware May run multiple jobs at the same time Often are multi-tenant - Different jobs running in datacenter have been developed or deployed by different people Often use virtualization - Hardware is multiplexed, e.g., multiple virtual machines per host Runs cloudy workload - Internet based applications: Email, Social Network, Maps, Search - Analytics: MapReduce/Hadoop, Pregel, NoSql, NewSql, etc... 3 Today

Modern Facebook Datacenter, Prineville, USA 4

Modern Datacenter, Inside view 5

Indirect Network: Common 3-Tier Datacenter Tree 6

Datacenter Networks (2) A rack has ~20-40 servers Example of a TOR switch with 48 ports 7

Example Configuration Data center with 11'520 machines Machines organized in racks and rows Data center with 24 rows Each row with 12 racks Each rack with 40 blades Machines in a rack interconnected with a ToR switch (access layer) ToR Switch with 48 GbE ports and 4 10GbE uplinks ToR switches connect to End-of-Row (EoR) switches via 1-4 10GigE uplinks (aggregation layer) For fault-tolerance ToR might be connected to EoR switches of different rows EoR switches typically 10GbE To support 12 ToR switches EoR would have to have 96 ports (4*12*2) Core Switch layer 12 10GigE switches with 96 ports each (24*48 ports) 8

Over-subscription High port switches are expensive cost increasing non-linear with the number of ports and bandwidth of ports Many data center designs introduce oversubscription as a means to lower the total cost of the design Oversubscription ratio of a switch: Ratio of ports facing downwards vs. ports facing upwards (with equal bandwidth ports) Switch up-links get heavily loaded Poor bisection bandwidth 9

Oversubscription (2) Typical over-subscription ratios ToR uplinks oversubscription 4:1 to 20:1 EoR uplinks oversubscription 4:1 10

Problem 1: Over-subscription (2) Cost estimate vs. maximum possible number of hosts for different over-subscription ratios (2008) 11

Indirect Network: Fat-Tree Network Idea: links become "fatter" as one moves up the tree towards the root Example: binary fat tree By judiciously choosing the fatness, a Fat-Tree network can provide full bisection bandwidth Or a oversubscription ratio of 1:1 Fat-Tree offers redundant paths (!) Portland: Datacenter network architecture proposed by Research Group at UC San Diego Fat-Tree built from commodity switches 12

Portland: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric Example: 16 port network Built from identical 4 port switches, 4 hosts organized into a 'pod' Each pod is a 2-ary 2-tree (two links up at each level, height of tree is two) Full bandwidth between hosts directly connected to a pod (4 links into pod, 4 links within pod) Full bandwidth between any host pair (16 links at edge-aggregation, 16 links aggregation-core 13

Fat-tree Scaling K-ary fat-tree K pods each containing two layers of k/2 switches Each k-port switch in edge layer (access layer) uses k/2 ports to connect to hosts and k/2 ports to connect to aggregation switches (K/2)^2 core switches Port (i) of any core switch connects to pod (i) Fat-tree built from 48-port switches supports (k^3)/4 hosts 27,648 hosts Cost: 8.64M as compared to 37M for traditional topology (2008) Approach scales to 10GbE at the Edge (!) 14

Scaling Ethernet in Datacenters Ethernet is the primary network technology for data centers High link bandwidth at comparably low cost Self-configuration (Spanning Tree Protocol) Outside of large data centers most networks are designed as several modest-sized Ethernets (IP subnets), connected by one or two layers of IP routes But datacenter operaters want Ethernet to scale to an entire datacenter, for several reasons: The use of IP subnets creates significant management complexity (DHCP configuration, etc.) The use of IP makes VM migration more difficult (requires VM to change its IP address to match the new subnet) 15

Why Ethernet is hard to scale Ethernet Spanning Tree protocol is not designed for large datacenters Does not leverage multipath if available Spanning tree allows only one path between any src/dst pair - Limits bandwidth - Low reliability Packet floods Switches discover hosts and creating routing entry Switches must forget table entries periodically to support host mobility Switch receiving a packet for unknown host will flood the packet on all ports Switch state Can become large if the entire datacenter is one layer-2 network 16

Switch State: Is it a problem? (Flat vs hierarchical addressing) Commodity switches (2010) have ~640KB of low-latency, power hungry, expensive on-chip memory 32-64K flow entries 10 million virtual endpoints in 500k servers in data center Flat addresses: 10 million address mappings ~ 100MB on chip memory ~150 times the memory size that can be put on chip today Hierarchical addresses: 100-1000 address mappings ~10KB of memory easily accommodated by commodity switches 17

Problem 3: Routing vs Switching Technique Plug-and-Play Scalability Small Switch State Seamless VM migration Layer 2 (Pure Switching) + - - + Layer 3 (Routing) - ~ + - Layer-2 Layer-2 plug-and-play through spanning tree protocol Layer-2 scalability problems: ARP, O(n) routing state, spanning tree, single path Layer-2 VM migration: VM can keep its IP address Layer-3: VM migration between subnets: need to change IP address (breaks TCP) Not plug-and-play: configure subnet number, dhcp server, etc Better scalability: hierarchical routing 18

Scaling Ethernet in Portland Portland: Research at Data Center Network Group, UC San Diego Single layer-2 network with up to 100K ports 1M endpoints (through virtualization) VM migration while keeping IP address Minimize amount of switch state Towards zero-configuration No subnet or hierarchical IP addresses, dhcp etc. First-class support for multi-path routing Uses a Fat-Tree Topology 19

Scaling Ethernet in Portland: Key principles Host IP address: Node identifier, fixed even after VM migration Pseudo MAC: node location In-network rewriting of MAC address PMAC address changes depending on location of host PMAC address encodes location of host PMAC used to do routing Fabric Manager: centralized lookup service Maintains IP->PMAC mappings Replaces ARP Lookup is unicast instead of broadcast 20

PMAC format 0 1 0 1 0 1 0 1 PMAC: pod.position.port.vmid 21

Autoconfiguration of PMACs at Switches Location Discovery Messages (LDM) exchanged between neighboring switches Discovery at bootup LDM protocol helps switches to learn Tree level (edge, aggregation, core) Pod number Position number Configuration does not involve broadcast 22

Autoconfiguration: Tree-level CS AS ES I am an edge switch (ES) if I receive LDM message on my uplink only Aggregation switches (AS) get messages from ES as well as from unknown switches Core switches get messages on all ports from AS 23

Autoconfiguration: Position number Run agreement protocol: propose random position number Use aggregation switches to ensure no two edge switches are assigned the same position number 24

Autoconfiguration: Pod number fabric manager 0 Use directory service to get the pod number 25

Autoconfiguration: Pod number fabric manager 0 Use directory service to get the pod number 26

PMAC Registration Switch communicates constructed PMAC for hosts to Fabric manager 27

Proxy ARP / Routing Address Resolution Protocol (ARP): Edge switches Intercept ARP requests, contacts fabric manager ARP reply contains PMAC Routing Based on PMAC (encodes location of host) Route packet to aggregation switch if in the same POD Route upwards if in a different POD 28

Portland Loadbalancing Fat-tree topology has multiple path at each level Can spread traffic across different path for loadbalancing Loadbalancing in Portland Use standard loadbalancing technique such as ECMP Equal-cost multi-path routing (ECMP) Hash flows/packets to paths (or port) 29

VL2: A Scalable and Flexible Data Center Architecture Alternative datacenter architecture to Portland Portland is network-centric: intelligence in switches VL2 is end server-centric: intelligence in servers Key ideas of VL2: Each server has two IP addresses AA: Application address (location independent) LA: Location dependent address Each switch has a LA VL2 agent on each server intercepting ARP Mapping of AA to LA Routing: Lookup of LA of switch which serves the destination host Tunnel application packet to switch using the LA of the switch VL2 shares concepts with Portland: Fat-Tree, Directory Server, ECMP 30

Loadbalancing in VL2 Uses ECMP and VLB ECMP not agile enough Valiant Load Balancing (VLB) Spread traffic over set of intermediate nodes (core switches) When transmitting a packet: server picks a 'random' core switch Packet is tunneled from source to intermediate switch How to find a good random core switch? Trick: assign same LA to all core switches Use ECMP to spread traffic tunneled to the core switch LA 31

VL2 Example Each AA is has an associated LA (LA of ToR Switch), mapping stored in VL2 Directory Routing: Server traps packet and encapsulates it with the LA address of the ToR of the destination Load balancing: Source ToR encapsulates packet to a randomly chosen core switch 32

References Portland: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric, Sigcomm 2009 VL2: Scalable and Flexible Data Center Network, Sigcomm 2009 A Guided Tour of Data Center Networking, Dennis Abts and Bob Fielderman, ACM QUEUE, June 2012 33