Albert Greenberg Principal Researcher albert@microsoft.com

Similar documents
Data Center Challenges Building Networks for Agility

Data Center Challenges Building Networks for Agility

Data Center Network Topologies: VL2 (Virtual Layer 2)

Load Balancing Mechanisms in Data Center Networks

Datacenter Networks Are In My Way

What Goes Into a Data Center? Albert Greenberg, David A. Maltz

Walmart s Data Center. Amadeus Data Center. Google s Data Center. Data Center Evolution 1.0. Data Center Evolution 2.0

Chapter 6. Paper Study: Data Center Networking

Data Center Network Architectures

Outline. VL2: A Scalable and Flexible Data Center Network. Problem. Introduction 11/26/2012

Network Virtualization and Data Center Networks Data Center Virtualization - Basics. Qin Yin Fall Semester 2013

Operating Systems. Cloud Computing and Data Centers

Lecture 7: Data Center Networks"

Data Center Network Topologies: FatTree

Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya and Amin Vahdat

Advanced Computer Networks. Datacenter Network Fabric

PortLand:! A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

Optimizing Data Center Networks for Cloud Computing

Portland: how to use the topology feature of the datacenter network to scale routing and forwarding

AIN: A Blueprint for an All-IP Data Center Network

Evaluating the Impact of Data Center Network Architectures on Application Performance in Virtualized Environments

Paolo Costa

Towards a Next Generation Data Center Architecture: Scalability and Commoditization

BURSTING DATA BETWEEN DATA CENTERS CASE FOR TRANSPORT SDN

Chapter 1 Reading Organizer

Applying NOX to the Datacenter

Brocade Solution for EMC VSPEX Server Virtualization

CCNA R&S: Introduction to Networks. Chapter 5: Ethernet

Computer Networks COSC 6377

Network Virtualization for Large-Scale Data Centers

OpenFlow based Load Balancing for Fat-Tree Networks with Multipath Support

20. Switched Local Area Networks

Data Center Infrastructure of the future. Alexei Agueev, Systems Engineer

Ethernet-based Software Defined Network (SDN) Cloud Computing Research Center for Mobile Applications (CCMA), ITRI 雲 端 運 算 行 動 應 用 研 究 中 心

ADVANCED NETWORK CONFIGURATION GUIDE

OVERLAYING VIRTUALIZED LAYER 2 NETWORKS OVER LAYER 3 NETWORKS

Why Software Defined Networking (SDN)? Boyan Sotirov

A Reliability Analysis of Datacenter Topologies

A previous version of this paper was published in the Proceedings of SIGCOMM 09 (Barcelona, Spain, Aug , 2009). ACM, New York.

基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器

Oracle SDN Performance Acceleration with Software-Defined Networking

Juniper Networks QFabric: Scaling for the Modern Data Center

Architecting Data Center Networks in the era of Big Data and Cloud

Performance Evaluation of Linux Bridge

Analysis of Network Segmentation Techniques in Cloud Data Centers

Core and Pod Data Center Design

A Case for Overlays in DCN Virtualization Katherine Barabash, Rami Cohen, David Hadas, Vinit Jain, Renato Recio and Benny Rochwerger IBM

Software Defined Networking What is it, how does it work, and what is it good for?

Networking Topology For Your System

Ethernet Fabrics: An Architecture for Cloud Networking

The Software Defined Hybrid Packet Optical Datacenter Network SDN AT LIGHT SPEED TM CALIENT Technologies

Based on Computer Networking, 4 th Edition by Kurose and Ross

SAN Conceptual and Design Basics

Multipath TCP in Data Centres (work in progress)

SCALABILITY AND AVAILABILITY

Hedera: Dynamic Flow Scheduling for Data Center Networks

Software Defined Networking

Scale and Efficiency in Data Center Networks

Torii-HLMAC: Torii-HLMAC: Fat Tree Data Center Architecture Elisa Rojas University of Alcala (Spain)

Panel: Cloud/SDN/NFV 黃 仁 竑 教 授 國 立 中 正 大 學 資 工 系 2015/12/26

VXLAN: Scaling Data Center Capacity. White Paper

SummitStack in the Data Center

How To Make A Vpc More Secure With A Cloud Network Overlay (Network) On A Vlan) On An Openstack Vlan On A Server On A Network On A 2D (Vlan) (Vpn) On Your Vlan

IP Networking. Overview. Networks Impact Daily Life. IP Networking - Part 1. How Networks Impact Daily Life. How Networks Impact Daily Life

TRILL Large Layer 2 Network Solution

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

How Linux kernel enables MidoNet s overlay networks for virtualized environments. LinuxTag Berlin, May 2014

Affording the Upgrade to Higher Speed & Density

Load Balancing in Data Center Networks

Introduction to LAN/WAN. Network Layer (part II)

Feature Comparison. Windows Server 2008 R2 Hyper-V and Windows Server 2012 Hyper-V

Traffic Engineering for Multiple Spanning Tree Protocol in Large Data Centers

TRILL for Service Provider Data Center and IXP. Francois Tallet, Cisco Systems

TEST METHODOLOGY. Hypervisors For x86 Virtualization. v1.0

SDN AND SECURITY: Why Take Over the Hosts When You Can Take Over the Network

STATE OF THE ART OF DATA CENTRE NETWORK TECHNOLOGIES CASE: COMPARISON BETWEEN ETHERNET FABRIC SOLUTIONS

VMDC 3.0 Design Overview

Flexible SDN Transport Networks With Optical Circuit Switching

N5 NETWORKING BEST PRACTICES

Internet Working 5 th lecture. Chair of Communication Systems Department of Applied Sciences University of Freiburg 2004

Nutanix Tech Note. VMware vsphere Networking on Nutanix

Technology Insight Series

Simplify Your Data Center Network to Improve Performance and Decrease Costs

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Resource Efficient Computing for Warehouse-scale Datacenters

Challenges of Sending Large Files Over Public Internet

Brocade One Data Center Cloud-Optimized Networks

Panopticon: Incremental SDN Deployment in Enterprise Networks

Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308

Integrating Servers and Networking using an XOR-based Flat Routing Mechanism in 3-cube Server-centric Data Centers

Scaling out a SharePoint Farm and Configuring Network Load Balancing on the Web Servers. Steve Smith Combined Knowledge MVP SharePoint Server

How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda

Transcription:

Networking The Cloud Albert Greenberg Principal Researcher albert@microsoft.com (work with James Hamilton, rikanth Kandula, Dave Maltz, Parveen Patel, udipta engupta, Changhoon Kim)

Agenda Data Center Costs Importance of Agility Today s Data Center Network A Better Way? Albert Greenberg, ICDC 2009 keynote 2

Data Center Costs Amortized Cost* Component ub-components ~45% ervers CPU, memory, disk ~25% Power infrastructure UP, cooling, power distribution ~15% Power draw Electrical utility costs ~15% Network witches, links, transit Total cost varies Upwards of $1/4 B for mega data center erver costs dominate Network costs significant The Cost of a Cloud: Research Problems in Data Center Networks. igcomm CCR 2009. Greenberg, Hamilton, Maltz, Patel. *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money Albert Greenberg, ICDC 2009 keynote 3

erver Costs Ugly secret: 30% utilization considered good in data centers Causes include: Uneven application fit: Each server has CPU, memory, disk: most applications exhaust one resource, stranding the others Long provisioning timescales: New servers purchased quarterly at best Uncertainty in demand: Demand for a new service can spike quickly Risk management: Not having spare servers to meet demand brings failure just when success is at hand ession state and storage constraints If the world were stateless servers, life would be good 4

Goal: Agility Any service, Any erver Turn the servers into a single large fungible pool Let services breathe : dynamically expand and contract their footprint as needed Benefits Increase service developer productivity Lower cost Achieve high performance and reliability The 3 motivators of most infrastructure projects 5

Achieving Agility Workload management Means for rapidly installing a service s code on a server Virtual machines, disk images torage Management Means for a server to access persistent data Distributed filesystems (e.g., blob stores) Network Means for communicating with other servers, regardless of where they are in the data center 6

Network Objectives Developers want a mental model where all their servers, and only their servers, are plugged into an Ethernet switch 1. Uniform high capacity Capacity between servers limited only by their NICs No need to consider topology when adding servers 2. Performance isolation Traffic of one service should be unaffected by others 3. Layer-2 semantics Flat addressing, so any server can have any IP address erver configuration is the same as in a LAN Legacy applications depending on broadcast must work 7

Agenda Data Center Costs Importance of Agility Today s Data Center Network A Better Way? 8

The Network of a Modern Data Center Internet Data Center Layer 3 Internet CR CR AR AR AR AR Layer 2 LB LB Ref: Data Center: Load Balancing Data Center ervices, Cisco 2004 Key: CR = L3 Core Router AR = L3 Access Router = L2 witch LB = Load Balancer A = Rack of 20 servers with Top of Rack switch ~ 4,000 servers/pod Hierarchical network; 1+1 redundancy Equipment higher in the hierarchy handles more traffic, more expensive, more efforts made at availability scale-up design ervers connect via 1 Gbps UTP to Top of Rack switches Other links are mix of 1G, 10G; fiber, copper Albert Greenberg, ICDC 2009 keynote 9

Internal Fragmentation Prevents Applications from Dynamically Growing/hrinking CR Internet CR AR AR AR AR LB LB LB LB VLANs used to isolate properties from each other IP addresses topologically determined by ARs Reconfiguration of IPs and VLAN trunks painful, errorprone, slow, often manual Albert Greenberg, ICDC 2009 keynote 10

No Performance Isolation CR Internet CR AR AR AR AR LB LB Collateral damage LB LB VLANs typically provide reachability isolation only One service sending/receiving too much traffic hurts all services sharing its subtree Albert Greenberg, ICDC 2009 keynote 11

Network has Limited erver-to-erver Capacity, and Requires Traffic Engineering to Use What It Has AR CR Internet 10:1 over-subscription or worse (80:1, 240:1) AR CR AR AR LB LB LB LB Data centers run two kinds of applications: Outward facing (serving web pages to users) Internal computation (computing search index think HPC) Albert Greenberg, ICDC 2009 keynote 12

Network Needs Greater Bisection BW, and Requires Traffic Engineering to Use What It Has CR Internet CR LB AR Dynamic reassignment of servers and AR Map/Reduce-style computations mean AR traffic matrix LB is constantly LB changing AR LB Explicit traffic engineering is a nightmare Data centers run two kinds of applications: Outward facing (serving web pages to users) Internal computation (computing search index think HPC) Albert Greenberg, ICDC 2009 keynote 13

What Do Data Center Faults Look Like? Need very high reliability near top of the tree Very hard to achieve Example: failure of a temporarily unpaired core switch affected ten million users for four hours 0.3% of failure events knocked out all members of a network redundancy group LB CR CR AR AR AR AR Ref: Data Center: Load Balancing Data Center ervices, Cisco 2004 LB VL2: A Flexible and calable Data Center Network. igcomm 2009. Greenberg, Jain, Kandula, Kim, Lahiri, Maltz, Patel, engupta. Albert Greenberg, ICDC 2009 keynote 14

Agenda Data Center Costs Importance of Agility Today s Data Center Network A Better Way? Albert Greenberg, ICDC 2009 keynote 15

Agenda Data Center Costs Importance of Agility Today s Data Center Network A Better Way? Building Blocks Traffic New Design Albert Greenberg, ICDC 2009 keynote 16

X-ceiver (erdes) witch on Chip AICs General purpose CPU for control plane witch-on-a-chip AIC AIC floorplan Forwarding tables Forwarding pipeline Packet buffer Memory Current design points 24 port 1G, 4 10G 16K IPv4 fwd entries, 2 MB buff 24 port 10G Eth, 16K IPv4 fwd entries, 2 MB buff Future 48 port 10G, 16K fwd entries, 4 MB buff Trends towards more ports, faster port speed Albert Greenberg, ICDC 2009 keynote 17

Packaging witch Combine AICs ilicon fab costs drive AIC price Market size drives packaged switch price Economize links: On chip < on PCB < on chassis < between chasses Example: 144 port 10G switch, built from 24 port switch AICs in single chassis. Link technologies FP+ 10G port $100, MM fiber 300m reach QFP (Quad FP) 40G port avail today 4 10G bound together Fiber ribbon cables Up to 72 fibers per cable, to a single MT connector Albert Greenberg, ICDC 2009 keynote 18

Latency Propagation delay in the data center is essentially 0 Light goes a foot in a nanosecond; 1000 = 1 usec End to end latency comes from witching latency 10G to 10G:~ 2.5 usec (store&fwd); 2 usec (cut-thru) Queueing latency Depends on size of queues and network load Typical times across a quiet data center: 10-20usec Worst-case measurement (from our testbed, not real DC, with all2all traffic pounding and link util > 86%): 2-8 ms Comparison: Time across a typical host network stack is 10 usec Application developer LAs > 1 ms granularity Albert Greenberg, ICDC 2009 keynote 19

Agenda Data Center Costs Importance of Agility Today s Data Center Network A Better Way? Building Blocks Traffic New Design Albert Greenberg, ICDC 2009 keynote 20

Measuring Traffic in Today s Data Centers 80% of the packets stay inside the data center Data mining, index computations, back end to front end Trend is towards even more internal communication Detailed measurement study of data mining cluster 1,500 servers, 79 Top of Rack (ToR) switches Logged: 5-tuple and size of all socket-level R/W ops Aggregated in flows all activity separated by < 60 s Aggregated into traffic matrices every 100 s rc, Dst, Bytes of data exchange Albert Greenberg, ICDC 2009 keynote 21

Flow Characteristics DC traffic!= Internet traffic Most of the flows: various mice Most of the bytes: within 100MB flows Median of 10 concurrent flows per server Albert Greenberg, ICDC 2009 keynote 22

Traffic Matrix Volatility - Collapse similar traffic matrices (over 100sec) into clusters - Need 50-60 clusters to cover a day s traffic - Traffic pattern changes nearly constantly - Run length is 100s to 80% percentile; 99 th is 800s Albert Greenberg, ICDC 2009 keynote 23

Today, Computation Constrained by Network* Figure: ln(bytes/10sec) between servers in operational cluster Great efforts required to place communicating servers under the same ToR Most traffic lies on the diagonal (w/o log scale all you see is the diagonal) tripes show there is need for inter-tor communication *Kandula, engupta, Greenberg,Patel Albert Greenberg, ICDC 2009 keynote 24

Congestion: Hits Hard When it Hits* *Kandula, engupta, Greenberg, Patel Albert Greenberg, ICDC 2009 keynote 25

Agenda Data Center Costs Importance of Agility Today s Data Center Network A Better Way? Building Blocks Traffic New Design VL2: A Flexible and calable Data Center Network. igcomm 2009. Greenberg, Hamilton, Jain, Kandula, Kim, Lahiri, Maltz, Patel, engupta. Towards a Next Generation Data Center Architecture: calability and Commoditization. Presto 2009. Greenberg, Maltz, Patel, engupta, Lahiri. PortLand: A calable Fault-Tolerant Layer 2 Data Center Network Fabric. Mysore, Pamboris, Farrington, Huang, Miri, Radhakrishnan, ubramanya, Vahdat BCube: A High Performance, erver-centric Network Architecture for Modular Data Centers. Guo, Lu, Li, Wu, Zhang, hi, Tian, Zhang, Lu Albert Greenberg, ICDC 2009 keynote 26

VL2: Distinguishing Design Principles Randomizing to Cope with Volatility Tremendous variability in traffic matrices eparating Names from Locations Any server, any service Embracing End ystems Leverage the programmability & resources of servers Avoid changes to switches Building on Proven Networking Technology We can build with parts shipping today Leverage low cost, powerful merchant silicon AICs, though do not rely on any one vendor Albert Greenberg, ICDC 2009 keynote 27

What Enables a New olution Now? Programmable switches with high port density Fast: AIC switches on a chip (Broadcom, Fulcrum, ) Cheap: mall buffers, small forwarding tables Flexible: Programmable control planes Centralized coordination cale-out data centers are not like enterprise networks Centralized services already control/monitor health and role of each server (Autopilot) Centralized directory and control plane acceptable (4D) 20 port 10GE switch. List price: $10K Albert Greenberg, ICDC 2009 keynote 28

An Example VL2 Topology: Clos Network D/2 switches D ports... Intermediate node switches in VLB Node degree (D) of available switches & # servers supported D/2 ports D/2 ports 10G D switches... Aggregation switches D # ervers in pool 4 80 24 2,880 48 11,520 144 103,680 20 ports Top Of Rack switch [D 2 /4] * 20 ervers A scale-out design with broad layers ame bisection capacity at each layer no oversubscription Extensive path diversity Graceful degradation under failure Albert Greenberg, ICDC 2009 keynote 29

Use Randomization to Cope with Volatility D/2 switches D ports D/2 ports D/2 ports 10G D switches...... Intermediate node switches in VLB Aggregation switches Node degree (D) of available switches & # servers supported D # ervers in pool 4 80 24 2,880 48 11,520 144 103,680 20 ports Top Of Rack switch [D 2 /4] * 20 ervers Valiant Load Balancing Every flow bounced off a random intermediate switch Provably hotspot free for any admissible traffic matrix ervers could randomize flow-lets if needed Albert Greenberg, ICDC 2009 keynote 30

eparating Names from Locations: How mart ervers Use Dumb witches Headers Dest: N rc: Dest: TD rc: Dest: D rc: Payload Dest: N rc: Dest: TD rc: Dest: TD rc: Dest: D rc: Dest: D rc: Payload Payload Intermediate Node (N) 2 3 ToR (T) ToR (TD) 1 4 Dest: D Payload rc: ource () Dest (D) Encapsulation used to transfer complexity to servers Commodity switches have simple forwarding primitives Complexity moved to computing the headers Many types of encapsulation available IEEE 802.1ah defines MAC-in-MAC encapsulation; VLANs; etc. Albert Greenberg, ICDC 2009 keynote 31

Embracing End ystems User Kernel VL2 Agent TCP erver Role Directory ystem Network Health Resolve remote IP IP ARP Encapsulator NIC MAC Resolution Cache erver Health erver machine Data center Oes already heavily modified for VMs, storage clouds, etc. A thin shim for network support is no big deal No change to applications or clients outside DC Albert Greenberg, ICDC 2009 keynote 32

VL2 Prototype 4 ToR switches, 3 aggregation switches, 3 intermediate switches Experiments conducted with both 40 and 80 servers Albert Greenberg, ICDC 2009 keynote 33

VL2 Achieves Uniform High Throughput Experiment: all-to-all shuffle of 500 MB among 75 servers 2.7 TB Excellent metric of overall efficiency and performance All2All shuffle is superset of other traffic patterns Results: Ave goodput: 58.6 Gbps; Fairness index:.995; Ave link util: 86% Perfect system-wide efficiency would yield aggregate goodput of 75G VL2 efficiency is 78% of perfect 10% inefficiency due to duplexing issues; 7% header overhead VL2 efficiency is 94% of optimal Albert Greenberg, ICDC 2009 keynote 34

VL2 Provides Performance Isolation - ervice 1 unaffected by service 2 s activity Albert Greenberg, ICDC 2009 keynote 35

VL2 is resilient to link failures - Performance degrades and recovers gracefully as links are failed and restored Albert Greenberg, ICDC 2009 keynote 36

D/2 ports D/2 ports 2 0 p o r t s D ports 10G Top Of Rack switch D/2 switches D switches [D 2 /4] * 20 ervers...... Intermediate node switches in VLB Aggregatio n switches ummary Amortized Cost Component ub-components ~45% ervers CPU, memory, disk It s about agility Increase data center capacity ~25% Infrastructure UP, cooling, power distribution ~15% Power draw Electrical utility costs ~15% Network witches, links, transit Any service on any server, anywhere in the data center VL2 enables agility and ends oversubscription Results have near perfect scaling Increases service developer productivity A simpler abstraction -- all servers plugged into one huge Ethernet switch Lowers cost High scale server pooling, with tenant perf isolation Commodity networking components Achieves high performance and reliability Gives us some confidence that design will scale-out Prototype behaves as theory predicts Albert Greenberg, ICDC 2009 keynote 37

Thank You Albert Greenberg, ICDC 2009 keynote 38

Lay of the Land Internet Users CDNs, ECNs Data Centers Albert Greenberg, ICDC 2009 keynote 39

Problem cope 200+ online properties In excess of one PB a day to/from the Internet Cost and Performance Cost : ~$50M spend in FY08 Performance [KDD 07; Kohavi] Amazon: 1% sales loss for an extra 100 ms delay Google: 20% sales loss for an extra 500 ms delay Apps are chatty: N RTT can quickly get to 100 ms and beyond How to manage for the sweet spot of the cost/perf tradeoff? Hard problem 200 properties x 18 data centers x 250K prefixes x 10 alternate paths Albert Greenberg, ICDC Microsoft 2009 Confidential keynote 40

Perspective Good News Extremely well-connected to the Internet (1,200 Aes!) 18+ data centers (mega and micro) around the world Bad news BGP (Border Gateway Protocol) picks one default path oblivious to cost and performance What s best depends on the property, the data center, the egress options, and the user Humans (GN engineers) face an increasingly difficult task Albert Greenberg, ICDC Microsoft 2009 Confidential keynote 41

Optimization weet pot, trading off Perf and Cost Albert Greenberg, ICDC Microsoft 2009 Confidential keynote 42

olution Produce the ENTACT curve, for TE strategies Automatically choose optimal operating points under certain highlevel objectives Default Routing weet pot Albert Greenberg, ICDC Microsoft 2009 Confidential keynote 43

Conventional Networking Equipment Modular routers Chassis $20K upervisor card $18K Ports 8 port 10G-X - $25K 1GB buffer memory Max ~120 ports of 10G per switch Total price in common configurations: $150-200K (+W&maint) Power: ~2-5KW Integrated Top of Rack switches 48 port 1GBase-T 2-4 ports 1 or 10G-x $7K Load Balancers pread TCP connections over servers $50-$75K each Used in pairs Albert Greenberg, ICDC 2009 keynote 44

VLB vs. Adaptive vs. Best Oblivious Routing VLB does as well as adaptive routing (traffic engineering using an oracle) on Data Center traffic Worst link is 20% busier with VLB, median is same Albert Greenberg, ICDC 2009 keynote 45