15-744: Computer Networking Routing in the DC



Similar documents
Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya and Amin Vahdat

Portland: how to use the topology feature of the datacenter network to scale routing and forwarding

Data Center Network Topologies: VL2 (Virtual Layer 2)

Advanced Computer Networks. Datacenter Network Fabric

Outline. VL2: A Scalable and Flexible Data Center Network. Problem. Introduction 11/26/2012

Operating Systems. Cloud Computing and Data Centers

Multipath TCP in Data Centres (work in progress)

Optical interconnection networks for data centers

Data Center Network Topologies: FatTree

MMPTCP: A Novel Transport Protocol for Data Centre Networks

How Router Technology Shapes Inter-Cloud Computing Service Architecture for The Future Internet

Data Center Network Topologies

Lecture 7: Data Center Networks"

Computer Networks COSC 6377

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Load Balancing Mechanisms in Data Center Networks

PortLand:! A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Scale and Efficiency in Data Center Networks

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Low-rate TCP-targeted Denial of Service Attack Defense

Energy Constrained Resource Scheduling for Cloud Environment

Performance of Software Switching

Introduction to LAN/WAN. Network Layer (part II)

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

Delivering Scale Out Data Center Networking with Optics Why and How. Amin Vahdat Google/UC San Diego

TCP Adaptation for MPI on Long-and-Fat Networks

Challenges of Sending Large Files Over Public Internet

Question: 3 When using Application Intelligence, Server Time may be defined as.

Chapter 1 Reading Organizer

Photonic Switching Applications in Data Centers & Cloud Computing Networks

OpenFlow Based Load Balancing

Paolo Costa

Data Center Networking with Multipath TCP

TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) Internet Protocol (IP)

D1.2 Network Load Balancing

Cisco Application Networking for IBM WebSphere

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011

AIX NFS Client Performance Improvements for Databases on NAS

The Software Defined Hybrid Packet Optical Datacenter Network SDN AT LIGHT SPEED TM CALIENT Technologies

Comparisons of SDN OpenFlow Controllers over EstiNet: Ryu vs. NOX

The Case Against Jumbo Frames. Richard A Steenbergen <ras@gtt.net> GTT Communications, Inc.

Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches

Cisco s Massively Scalable Data Center. Network Fabric for Warehouse Scale Computer

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Boost Database Performance with the Cisco UCS Storage Accelerator

A Hybrid Electrical and Optical Networking Topology of Data Center for Big Data Network

Virtual PortChannels: Building Networks without Spanning Tree Protocol

Deploying Silver Peak VXOA with EMC Isilon SyncIQ. February

Cisco Data Center Network Manager Release 5.1 (LAN)

Green HPC - Dynamic Power Management in HPC

Ring Protection: Wrapping vs. Steering

Avid ISIS

VMware Virtual SAN 6.2 Network Design Guide

Latency Considerations for 10GBase-T PHYs

Congestion Control Review Computer Networking. Resource Management Approaches. Traffic and Resource Management. What is congestion control?

Concepts Introduced in Chapter 6. Warehouse-Scale Computers. Important Design Factors for WSCs. Programming Models for WSCs

ARISTA WHITE PAPER Why Big Data Needs Big Buffer Switches

Windows Server Performance Monitoring

Optical interconnects in data centers

How To Use The Cisco Wide Area Application Services (Waas) Network Module

Current Status of FEFS for the K computer

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Green Computing: Datacentres

International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE) ISSN: Volume 8 Issue 1 APRIL 2014.

The new frontier of the DATA acquisition using 1 and 10 Gb/s Ethernet links. Filippo Costa on behalf of the ALICE DAQ group

Scala Storage Scale-Out Clustered Storage White Paper

Chapter 3. Enterprise Campus Network Design

Cisco s Massively Scalable Data Center

Microsoft Exchange Server 2003 Deployment Considerations

ADVANCED NETWORK CONFIGURATION GUIDE

Windows Server 2008 R2 Hyper-V Live Migration

The proliferation of the raw processing

Chapter 6. Paper Study: Data Center Networking

Network Technologies for Next-generation Data Centers

VMWARE WHITE PAPER 1

TCP in Wireless Mobile Networks

Cisco Active Network Abstraction Gateway High Availability Solution

Datacenter Networks Are In My Way

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Last time. Data Center as a Computer. Today. Data Center Construction (and management)

Denial-of-Service Shrew Attacks

How To Write An Article On An Hp Appsystem For Spera Hana

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Router Architectures

Improving Effective WAN Throughput for Large Data Flows By Peter Sevcik and Rebecca Wetzel November 2008

Cisco Active Network Abstraction 4.0

Recommended hardware system configurations for ANSYS users

基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器

Transcription:

Overview Data Center Overview 5-744: Computer Networking Routing in the DC Transport in the DC L-24 Data Center Networking 2 Datacenter Arms Race Google Oregon Datacenter Amazon, Google, Microsoft, Yahoo!, race to build next-gen mega-datacenters Industrial-scale Information Technology 00,000+ servers Located where land, water, fiber-optic connectivity, and cheap power are available E.g., Microsoft Quincy 43600 sq. ft. (0 football fields), sized for 48 MW Also Chicago, San Antonio, Dublin @$500M each E.g., Google: The Dalles OR, Pryor OK, Council Bluffs, IW, Lenoir NC, Goose Creek, SC 3 4

Computers + Net + Storage + Power + Cooling Energy Proportional Computing The Case for Energy-Proportional Computing, Luiz André Barroso, Urs Hölzle, IEEE Computer December 2007 It is surprisingly hard to achieve high levels of utilization of typical servers (and your home PC or laptop is even worse) Figure. Average CPU utilization of more than 5,000 servers during a six-month period. Servers are rarely completely idle and seldom operate near their maximum utilization, instead operating most of the time at between 0 and 50 percent of their maximum 5 6 Energy Proportional Computing Energy Proportional Computing The Case for Energy-Proportional Computing, Luiz André Barroso, Urs Hölzle, IEEE Computer December 2007 Doing nothing well NOT! The Case for Energy-Proportional Computing, Luiz André Barroso, Urs Hölzle, IEEE Computer December 2007 Doing nothing VERY well Design for wide dynamic power range and active low power modes Figure 2. Server power usage and energy efficiency at varying utilization levels, from idle to peak performance. Even an energy-efficient server still consumes about half its full power when doing virtually no work. 7 Figure 4. Power usage and energy efficiency in a more energy-proportional server. This server has a power efficiency of more than 80 percent of its peak value for utilizations of 30 percent and above, with efficiency remaining above 50 percent for utilization levels as low as 0 percent. 8 2

DC Networking and Power Thermal Image of Typical Cluster Within DC racks, network equipment often the hottest components in the hot spot Network opportunities for power reduction Rack Switch Transition to higher speed interconnects (0 Gbs) at DC scales and densities High function/high power assists embedded in network element (e.g., TCAMs) M. K. Patterson, A. Pratt, P. Kumar, From UPS to Silicon: an end-to-end evaluation of datacenter efficiency, Intel Corporation 9 0 2 DC Networking and Power 96 x Gbit port Cisco datacenter switch consumes around 5 kw -approximately 00x a typical dual processor Google server @ 45 W High port density drives network element design, but such high power density makes it difficult to tightly pack them with servers Alternative distributed processing/communications topology under investigation by various research groups 3

Containerized Datacenters Containerized Datacenters Sun Modular Data Center Power/cooling for 200 KW of racked HW External taps for electricity, network, water 7.5 racks: ~250 Servers, 7 TB DRAM,.5 PB disk 3 4 Summary Energy Consumption in IT Equipment Energy Proportional Computing Inherent inefficiencies in electrical energy distribution Energy Consumption in Internet Datacenters Backend to billions of network capable devices Enormous processing, storage, and bandwidth supporting applications for huge user communities Resource Management: Processor, Memory, I/O, Network to maximize performance subject to power constraints: Do Nothing Well New packaging opportunities for better optimization of computing + communicating + power + mechanical Overview Data Center Overview Routing in the DC Transport in the DC 5 6 4

Layer 2 vs. Layer 3 for Data Centers Flat vs. Location Based Addresses Commodity switches today have ~640 KB of low latency, power hungry, expensive on chip memory Stores 32 64 K flow entries Assume 0 million virtual endpoints in 500,000 servers in datacenter Flat addresses 0 million address mappings ~00 MB on chip memory ~50 times the memory size that can be put on chip today Location based addresses 00 000 address mappings ~0 KB of memory easily accommodated in switches today 7 8 PortLand: Main Assumption Data Center Network Hierarchical structure of data center networks: They are multi-level, multi-rooted trees 9 20 5

Hierarchical Addresses Hierarchical Addresses 2 22 Hierarchical Addresses Hierarchical Addresses 23 24 6

Hierarchical Addresses Hierarchical Addresses 25 26 PortLand: Location Discovery Protocol Location Discovery Messages (LDMs) exchanged between neighboring switches Switches self-discover location on boot up Location Discovery Protocol 27 28 7

Location Discovery Protocol Location Discovery Protocol 29 30 Location Discovery Protocol Location Discovery Protocol 3 32 8

Location Discovery Protocol Location Discovery Protocol 33 34 Location Discovery Protocol Location Discovery Protocol 35 36 9

Location Discovery Protocol Location Discovery Protocol 37 38 Location Discovery Protocol Name Resolution 39 40 0

Name Resolution Name Resolution 4 42 Name Resolution Fabric Manager 43 44

Name Resolution Name Resolution 45 46 Name Resolution Other Schemes SEATTLE [SIGCOMM 08]: Layer 2 network fabric that works at enterprise scale Eliminates ARP broadcast, proposes one-hop DHT Eliminates flooding, uses broadcast based LSR Scalability limited by Broadcast based routing protocol Large switch state VL2 [SIGCOMM 09] Network architecture that scales to support huge data centers Layer 3 routing fabric used to implement a virtual layer 2 Scale Layer 2 via end host modifications Unmodified switch hardware and software End hosts modified to perform enhanced resolution to assist routing and forwarding 47 48 2

VL2: Name-Location Separation VL2: Random Indirection Cope with host churns with very little overhead Cope with arbitrary TMs with very little overhead ToR 3 ToR 34 Allows to use low-cost switches Protects network and hosts from host-state churn Obviates host and switch reconfiguration ToR... ToR 2... ToR 3... ToR 4 y z payload payload Switches run link-state routing and maintain only switch-level topology x y, z y z Directory Service x ToR 2 y ToR 3 z ToR 43 Lookup & Response Obviate esoteric traffic engineering or optimization Ensure T robustness T 2 to T 3 failures T 4 T 5 T 6 Work with switch mechanisms available today payload I ANY T 35 y z I ANY x y I ANY [ ECMP + IP Anycast ] Harness huge bisection bandwidth z I ANY Links used for up paths Links used for down paths Servers use flat names 49 50 Overview Cluster-based Storage Systems Data Center Overview Routing in the DC Synchronized Read R R R R 2 Transport in the DC Client 2 3 4 Switch Client now sends next batch of requests Storage Servers 3 4 Data Block Server Request Unit (SRU) 5 3

TCP Throughput Collapse TCP: Loss recovery comparison Collapse! Cluster Setup Gbps Ethernet Unmodified TCP S50 Switch MB Block Size Timeout driven recovery is slow (ms) Seq # 2 3 4 5 Data-driven recovery is super fast (µs) in datacenters Seq # 2 3 4 5 Retransmit 2 Ack Ack Ack Ack Ack 5 TCP Incast Cause of throughput collapse: coarse-grained TCP timeouts Retransmission Timeout (RTO) Sender Ack Receiver Sender Receiver Link Idle Time Due To Timeouts Client Link Utilization Synchronized Read R R R R 2 4 Link Idle! Client Switch 3 200ms Req. sent 2 3 4 Server 4 Request Unit (SRU) Rsp. sent 4 dropped 3 done Link Idle! Response resent time 4

Default minrto: Throughput Collapse Lowering minrto to ms helps ms minrto Unmodified TCP (200ms minrto) Unmodified TCP (200ms minrto) Millisecond retransmissions are not enough Solution: µsecond TCP + no minrto Simulation: Scaling to thousands microsecond TCP + no minrto ms minrto more servers Unmodified TCP (200ms minrto) High throughput for up to 47 servers Block Size = 80MB, Buffer = 32KB, RTT = 20us 5

Delayed-ACK (for RTO > 40ms) µsecond RTO and Delayed-ACK Seq # 2 Ack 2 Seq # Ack 40ms Seq # 2 Ack 0 RTO > 40ms Seq # 40ms RTO < 40ms Seq # Timeout Retransmit packet Ack Ack Sender Receiver Sender Receiver Sender Receiver Delayed-Ack: Optimization to reduce #ACKs sent Sender Receiver Sender Receiver Premature Timeout RTO on sender triggers before Delayed-ACK on receiver Impact of Delayed-ACK Is it safe for the wide-area? Stability: Could we cause congestion collapse? No: Wide-area RTOs are in 0s, 00s of ms No: Timeouts result in rediscovering link capacity (slow down the rate of transfer) Performance: Do we timeout unnecessarily? [Allman99] Reducing minrto increases the chance of premature timeouts Premature timeouts slow transfer rate Today: detect and recover from premature timeouts Wide-area experiments to determine performance impact 6

Wide-area Experiment Wide-area Experiment: Results BitTorrent Seeds BitTorrent Clients Microsecond TCP + No minrto Standard TCP Do microsecond timeouts harm wide-area throughput? No noticeable difference in throughput Other Efforts Topology Using extra links to meet traffic matrix 60Ghz links MSR paper in HotNets09 Reconfigurable optical interconnects CMU and UCSD in Sigcomm200 Transport Data Center TCP data-center only protocol that uses RED-like techniques in routers Next Lecture Topology Required reading On Power-Law Relationships of the Internet Topology A First-Principles Approach to Understanding the Internet s Router-level Topology Optional reading Measuring ISP Topologies with Rocketfuel 67 68 7

Aside: Disk Power IBM Microdrive (inch) writing 300mA (3.3V) W standby 65mA (3.3V).2W IBM TravelStar (2.5inch) read/write 2W spinning.8w low power idle.65w standby.25w sleep.w startup 4.7 W seek 2.3W Spin-down Disk Model.2W 4.7W Spinning up Trigger: request or predict Not Spinning Request Predictive Spinning down 2.3W Spinning & Seek Spinning & Ready Inactivity Timeout threshold* 2W Spinning & Access.65-.8W Disk Spindown Spin-Down Policies Disk Power Management Oracle (off-line) access IdleTime > BreakEvenTime access2 Disk Power Management Practical scheme (on-line) Idle for BreakEvenTime Wait time Fixed Thresholds T out = spin-down cost s.t. 2*E transition = P spin *T out Adaptive Thresholds: T out = f (recent accesses) Exploit burstiness in T idle Minimizing Bumps (user annoyance/latency) Predictive spin-ups Changing access patterns (making burstiness) Caching Prefetching Source: from the presentation slides of the authors 7 8

Google Since 2005, its data centers have been composed of standard shipping containers-- each with,60 servers and a power consumption that can reach 250 kilowatts Google server was 3.5 inches thick--2u, or 2 rack units, in data center parlance. It had two processors, two hard drives, and eight memory slots mounted on a motherboard built by Gigabyte Google's PUE In the third quarter of 2008, Google's PUE was.2, but it dropped to.20 for the fourth quarter and to.9 for the first quarter of 2009 through March 5 Newest facilities have.2 73 74 9