Symbiosis in Scale Out Networking and Data Management. Amin Vahdat Google/UC San Diego vahdat@google.com



Similar documents
Delivering Scale Out Data Center Networking with Optics Why and How. Amin Vahdat Google/UC San Diego

B4: Experience with a Globally-Deployed Software Defined WAN TO APPEAR IN SIGCOMM 13

Disaster Recovery Design Ehab Ashary University of Colorado at Colorado Springs

Scalable Data Center Networking. Amin Vahdat Computer Science and Engineering UC San Diego

Software-Defined Networks Powered by VellOS

Networking in the Hadoop Cluster

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Transform Your Business and Protect Your Cisco Nexus Investment While Adopting Cisco Application Centric Infrastructure

The Software Defined Hybrid Packet Optical Datacenter Network SDN AT LIGHT SPEED TM CALIENT Technologies

Data Center Networking Designing Today s Data Center

software networking Jithesh TJ, Santhosh Karipur QuEST Global

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

Core and Pod Data Center Design

Optimizing Data Center Networks for Cloud Computing

Chapter 1 Reading Organizer

Scale and Efficiency in Data Center Networks

Brocade Solution for EMC VSPEX Server Virtualization

Large-Scale Distributed Systems. Datacenter Networks. COMP6511A Spring 2014 HKUST. Lin Gu

Michael Kagan.

Hadoop Cluster Applications

SPEED your path to virtualization.

Software Defined Networking & Openflow

SOFTWARE-DEFINED NETWORKING AND OPENFLOW

Data Center Network Topologies: FatTree

White Paper Solarflare High-Performance Computing (HPC) Applications

Flexible SDN Transport Networks With Optical Circuit Switching

Scala Storage Scale-Out Clustered Storage White Paper

Testing Network Virtualization For Data Center and Cloud VERYX TECHNOLOGIES

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

State of the Art Cloud Infrastructure

SDN. What's Software Defined Networking? Angelo Capossele

Business Cases for Brocade Software-Defined Networking Use Cases

Networking Issues For Big Data

Testing Challenges for Modern Networks Built Using SDN and OpenFlow

Cloud Computing and the Internet. Conferenza GARR 2010

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Cisco s Massively Scalable Data Center

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Ethernet-based Software Defined Network (SDN) Cloud Computing Research Center for Mobile Applications (CCMA), ITRI 雲 端 運 算 行 動 應 用 研 究 中 心

TRILL for Service Provider Data Center and IXP. Francois Tallet, Cisco Systems

SMB Direct for SQL Server and Private Cloud

TRILL Large Layer 2 Network Solution

The Internet: A Remarkable Story. Inside the Net: A Different Story. Networks are Hard to Manage. Software Defined Networking Concepts

Towards Rack-scale Computing Challenges and Opportunities

How To Switch A Layer 1 Matrix Switch On A Network On A Cloud (Network) On A Microsoft Network (Network On A Server) On An Openflow (Network-1) On The Network (Netscout) On Your Network (

ALCATEL-LUCENT ENTERPRISE DATA CENTER SWITCHING SOLUTION Automation for the next-generation data center

Juniper Networks QFabric: Scaling for the Modern Data Center

Lecture 7: Data Center Networks"

Why Software Defined Networking (SDN)? Boyan Sotirov

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

WAN & Carrier Networks

Software Defined Networking What is it, how does it work, and what is it good for?

Non-blocking Switching in the Cloud Computing Era

Global Headquarters: 5 Speen Street Framingham, MA USA P F

Portland: how to use the topology feature of the datacenter network to scale routing and forwarding

THE REVOLUTION TOWARDS SOFTWARE- DEFINED NETWORKING

Deploying in a Distributed Environment

OmniCube. SimpliVity OmniCube and Multi Federation ROBO Reference Architecture. White Paper. Authors: Bob Gropman

Blue Planet. Introduction. Blue Planet Components. Benefits

Netvisor Software Defined Fabric Architecture

Software Defined Networks

GraySort on Apache Spark by Databricks

ALCATEL-LUCENT ENTERPRISE DATA CENTER SWITCHING SOLUTION Automation for the next-generation data center

Cloud Networking: A Novel Network Approach for Cloud Computing Models CQ1 2009

VMDC 3.0 Design Overview

The Future of Computing Cisco Unified Computing System. Markus Kunstmann Channels Systems Engineer

Datacenter Networks Are In My Way

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Virtualization, SDN and NFV

When Does Colocation Become Competitive With The Public Cloud? WHITE PAPER SEPTEMBER 2014

SOFTWARE-DEFINED NETWORKING AND OPENFLOW

Pluribus Netvisor Solution Brief

When Does Colocation Become Competitive With The Public Cloud?

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Cloud Computing Trends

Open Ethernet. April

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Microsoft Private Cloud Fast Track

Quantum StorNext. Product Brief: Distributed LAN Client

Testing Software Defined Network (SDN) For Data Center and Cloud VERYX TECHNOLOGIES

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

The best platform for building cloud infrastructures. Ralf von Gunten Sr. Systems Engineer VMware

Multi-Gigabit Intrusion Detection with OpenFlow and Commodity Clusters

基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器

Large Scale Clustering with Voltaire InfiniBand HyperScale Technology

Cloud Optimize Your IT

How To Understand The Power Of The Internet

Advanced Computer Networks. Datacenter Network Fabric

How Router Technology Shapes Inter-Cloud Computing Service Architecture for The Future Internet

Hedera: Dynamic Flow Scheduling for Data Center Networks

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Increase Simplicity and Improve Reliability with VPLS on the MX Series Routers

Platfora Big Data Analytics

What is SDN all about?

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Data Center Infrastructure of the future. Alexei Agueev, Systems Engineer

Scalable Approaches for Multitenant Cloud Data Centers

Transcription:

Symbiosis in Scale Out Networking and Data Management Amin Vahdat Google/UC San Diego vahdat@google.com

Overview Large-scale data processing needs scale out networking Unlocking the potential of modern server hardware for at scale problems requires orders of magnitude improvement in network performance Scale out networking requires large-scale data management Experience with Google s SDN WAN suggests that logically centralized state management critical for costeffective deployment and management Still in the stone ages in dynamically managing state and getting updates to the right places in the network

Overview Large-scale data processing needs scale out networking Unlocking the potential of modern server hardware for at scale problems requires orders of magnitude improvement in network performance WARNING: Networking is about to reinvent many aspects of centrally managed, replicated state with variety of consistency requirements in a distributed environment Scale out networking requires large-scale data management Experience with Google s SDN WAN suggests that logically centralized state management critical for costeffective deployment and management Still in the stone ages in dynamically managing state and getting updates to the right places in the network

Vignette 1: Large-Scale Data Processing Needs Scale Out Networking

Motivation Blueprints for 200k sq. ft. Data Center in OR

San Antonio Data Center

Chicago Data Center

Dublin Data Center

All Filled with Commodity Computation and Storage

Network Design Goals Scalable interconnection bandwidth Full bisection bandwidth between all pairs of hosts Aggregate bandwidth = # hosts host NIC capacity Economies of scale Price/port constant with number of hosts Must leverage commodity merchant silicon Anything anywhere Don t let the network limit benefits of virtualization Management Modular design Avoid actively managing 100 s-1000 s network elements

Scale Out Networking Advances toward scale out computing and storage Aggregate computing and storage grows linearly with the number of commodity processors and disks Small matter of software to enable functionality Alternative is scale up where weaker processors and smaller disks are replaced with more powerful parts Today, no technology for scale out networking Modules to expand number of ports or aggr BW No management of individual switches, VLANs, subnets

The Future Internet Applications and data will be partitioned and replicated across multiple data centers 99% of compute, storage, communication will be inside the data center Data Center Bandwidth Exceeds that of the Access Data sizes will continue to explode From click streams, to scientific data, to user audio, photo, and video collections Individual user requests and queries will run in parallel on thousands of machines Back end analytics and data processing will dominate

Emerging Rack Architecture Core Cache Core Cache Core Cache Core Cache DDR3-1600 100Gb/s 5ns PCIe 3.0 x16 DRAM (10 s GB) TBs storage 24-port 40 GigE switch 150 ns latency L3 Cache 128Gb/s ~250ns 2x10GigE NIC Can we leverage emerging merchant switch and newly proposed optical transceivers and switches to treat entire data center as single logical computer

Amdahl s (Lesser Known) Law Balanced Systems for parallel computing For every 1Mhz of processing power must have 1MB of memory 1 Mbit/sec I/O In the late 1960 s Fast forward to 2012 4x2.5Ghz processors, 8 cores 30-60Ghz of processing power (not that simple!) 24-64GB memory But 1Gb/sec of network bandwidth?? Deliver 40 Gb/s bandwidth to 100k servers? 4 Pb/sec of bandwidth required today

Sort as Instance of Balanced Systems Hypothesis: significant efficiency lost in systems that bottleneck on one resource Sort as example Gray Sort 2009 record 100 TB in 173 minutes on 3452 servers ~22.3 Mb/s/server Out of core sort: 2 reads and 2 writes required What would it take to sort 3.2 Gb/s/server? 4x100 MB/sec/node with 16 500 GB-disks/server 100 TB in 83 minutes on 50 server?

TritonSort Phase 1 Map and Shuffle! Reader Node LogicalDisk Sender Receiver Writer Distributor Distributor 8 2 8 Input Disk 8 Producer Buffer Pool Sender Node Buffer Pool Network Receiver Node Buffer Pool Writer Buffer Pool Intermediate Disk 8

TritonSort Phase 2 Reduce! Intermediate Disk 8 Reader Sorter Writer Output 8 4 8 Disk 8 Phase2 Buffer Pool

Reverse Engineering the Pipeline Goal: minimize number of logical disks Phase 2: read, sort, write (repeat) One sorter/core Need 24 buffers (3/core) ~20GB/server: 830MB/logical disk 2TB/830MB/logical disk è ~2400 logical disks Long pole in phase 1: LogicalDiskDistributor buffering sufficient data for streaming write ~18GB/2400 logical disks = 7.5MB buffer ~15% seek penalty

Balanced Systems Really Do Matter Balancing network and I/O results in huge efficiency improvements How much is a factor of 100 improvement worth in terms of cost? TritonSort: A Balanced Large-scale Sorting System, Rasmussen, et al., NSDI 2011. System! Duration! Aggr. Rate! Servers! Rate/server! Yahoo (100TB)! 173 min! 9.6 GB/s! 3452! 2.8 MB/s! TritonSort (100TB)! 107 min! 15.6 GB/s! 52! 300 MB/s!

TritonSort Results http://www.sortbenchmark.org Hardware HP DL-380 2U servers, 8-2.5 Ghz cores, 24 GB RAM, 16x500-GB disks, 2x10 Gb/s Myricom NICs 52-port Cisco Nexus 5020 switch Results 2010 GraySort: 100 TB in 123 mins/48 nodes, 2.3 Gb/s/server MinuteSort: 1014 GB in 59 secs/52 nodes, 2.6 Gb/s/server Results 2011 GraySort: 100 TB in 107 mins/52 nodes, 2.4 Gb/s/server MinuteSort: 1353 GB in 1 min/52 nodes, 3.5 Gb/s/server JouleSort: 9700 records/joule

Generalizing TritonSort Themis-MR TritonSort s very constrained 100B records, even key distribution Generalize with same performance? MapReduce natural choice: map sort reduce Skew: Partition, compute, record size, Memory management now hard Task-level to job-level fault tolerance for performance Long tail of small- to medium-sized jobs on <= 1PB of data

Current Status Themis-MR outperforms Hadoop 1.0 by ~8x on 28 node, 14TB GraySort 30 minutes vs. 4 hours Implementations of CloudBurst, PageRank, Word Count being evaluated Alpha version won 2011 Daytona GraySort Beat previous record holder by 26%, 1/70 nodes

Driver: Nonblocking Multistage Datacenter Topologies M. Al-Fares, A. Loukissas, A. Vahdat. A Scalable, Commodity Data Center Network Architecture. In SIGCOMM 08. k=4,n=3

Scalability Using Identical Network Elements Core Pod 0 Pod 1 Pod 2 Fat tree built from 4-port switches Pod 3

Scalability Using Identical Network Elements Core Pod 0 Pod 1 Pod 2 Pod 3 Support 16 hosts organized into 4 pods Each pod is a 2-ary 2-tree Full bandwidth among pod-connected hosts

Scalability Using Identical Network Elements Core Pod 0 Pod 1 Pod 2 Pod 3 Full bisection bandwidth at each level of fat tree Rearrangeably Nonblocking Entire fat-tree is a 2-ary 3-tree

Scalability Using Identical Network Elements Core Pod 0 Pod 1 Pod 2 Pod 3 (5k 2 /4) k-port switches support k 3 /4 hosts 48-port switches: 27,648 hosts using 2,880 switches Critically, approach scales to 10 GigE at the edge

Scalability Using Identical Network Elements Core Pod 0 Pod 1 Pod 2 Pod 3 Regular structure simplifies design of network protocols Opportunities: performance, cost, energy, fault tolerance, incremental scalability, etc.

Problem - 10 Tons of Cabling 55,296 Cat-6 cables 1,128 separate cable bundles If optics used for transport, transceivers are ~80% of cost of interconnect The Yellow Wall

Our Work Switch Architecture [SIGCOMM 08] Cabling, Merchant Silicon [Hot Interconnects 09] Virtualization, Layer 2, Management [SIGCOMM 09,SOCC11a] Routing/Forwarding [NSDI 10] Hybrid Optical/Electrical Switch [SIGCOMM 10,SOCC11b] Applications [NSDI11, FAST12] Low latency communication [NSDI12,ongoing] Transport Layer [EuroSys12,ongoing] Wireless augment [SIGCOMM12]

Vignette 2: Software Defined Networking Needs Data Management

Network Protocols Past and Future Historically, goal of network protocols is to eliminate centralization Every network element should act autonomously, using local information to effect global targets for fault tolerance, performance, policy, security The Internet probably would not have happened without such decentralized control Recent trends toward Software Defined Networking Deeper understanding of building scalable, fault tolerant logically centralized services Majority of network elements and bandwidth in data centers under the control of a single entity Requirements for virtualization and global policy

Software Defined Networking (SDN) VM Manager OpenFlow Controller Routing Service External Protocol OFA Switch OFA Switch OFA Switch OFA Switch Separate control plane from data plane Open Network Foundation and OpenFlow protocol leading the charge to enabling SDN OFC à OFA API?

SDN Challenges Control plane replication, fault tolerance, scale No fate sharing between control and data plane Configuration management When new router comes online or topology changes, how to push information to the right places? Network management: adaptively drill down to retrieve appropriate network state Virtualization and multiple control planes All challenges of large-scale distributed databases State of the art: CSV files and perl scripts

Google s Software Defined WAN Architecture

ATLAS 2010 Traffic report Posted on Monday, October 25th, 2010 Bookmark on del.icio.us Google Sets New Internet Traffic Record by Craig Labovitz This month, Google broke an equally impressive Internet traffic record gaining more than 1% of all Internet traffic share since January. If Google were an ISP, as of this month it would rank as the second largest carrier on the planet. Only one global tier1 provider still carries more traffic than Google (and this ISP also provides a large portion of Google s transit).

Cloud Computing Requires Massive Wide-Area Bandwidth Low latency access from global audience and highest levels of availability o Vast majority of data migrating to cloud o Data must be replicated at multiple sites WAN unit costs decreasing rapidly o But not quickly enough to keep up with even faster increase in WAN bandwidth demand

Hardware WAN Cost Components o Routers o Transport gear o Fiber Overprovisioning o Shortest path routing o Slow convergence time o Maintain SLAs despite failures o No traffic differentiation Operational expenses/human costs o Box-centric versus fabric-centric views

Why Software Defined WAN Separate hardware from software o Choose hardware based on necessary features o Choose software based on protocol requirements Logically centralized network control Automation: Separate monitoring, management, and operation from individual boxes Flexibility and Innovation Result: A WAN that is more efficient, higher performance, more fault tolerant, and cheaper

A Warehouse-Scale-Computer (WSC) Network Google Edge Carrier/ISP Edge Google Data Center Google Edge Carrier/ISP Edge Google Edge Carrier/ISP Edge Google Data Center Google Data Center

Two backbones Google's WAN o I-Scale: Internet facing (user traffic) o G-Scale: Datacenter traffic (internal) Widely varying requirements: loss sensitivity, topology, availability, etc. Widely varying traffic characteristics: smooth/diurnal vs. bursty/bulk

Google's Software Defined WAN

G-Scale Network Hardware Built from merchant silicon o 100s of ports of nonblocking 10GE OpenFlow support Open source routing stacks for BGP, ISIS Does not have all features o No support for AppleTalk... Multiple chassis per site o Fault tolerance o Scale to multiple Tbps

G-Scale WAN Deployment DC Network WAN DC Network Multiple switch chassis in each domain o Custom hardware running Linux Quagga BGP stack, ISIS/IBGP for internal connectivity

Mixed SDN Deployment Data Center Network Cluster Border Router EBGP IBGP/ISIS to remote sites (not representative of actual topology)

Mixed SDN Deployment Data Center Network Cluster Border Router EBGP IBGP/ISIS to remote sites Quagga OFC Paxos Glue Paxos Paxos

Mixed SDN Deployment Data Center Network Cluster Border Router EBGP IBGP/ISIS to remote sites OFA OFA EBGP OFA OFA IBGP/ISIS to remote sites Quagga Paxos OFC Glue Paxos Paxos

Mixed SDN Deployment OFA OFA Data Center Network Cluster Border Router OFA OFA OFA OFA EBGP OFA OFA IBGP/ISIS to remote sites Quagga Paxos OFC Glue Paxos Paxos SDN site delivers full interoperability with legacy sites

Mixed SDN Deployment OFA OFA Data Center Network Cluster Border Router OFA OFA OFA OFA EBGP OFA OFA IBGP/ISIS to remote sites Quagga Paxos OFC RCS TE Server Paxos Paxos Ready to introduce new functionality, e.g., TE

Bandwidth Broker and Traffic Engineering Google Confidential and Proprietary

High Level Architecture TE and B/w Allocation B/W Broker TE Server SDN API SDN Gateway to Sites Control Plane Collection / Enforcement Traffic sources for WAN SDN WAN (N sites) Data Plane

High Level Architecture TE and B/w Allocation Control Plane B/W Broker TE Server SDN API SDN Gateway to Sites Collection / Enforcement Traffic sources for WAN SDN WAN (N sites) Data Plane

Bandwidth Broker Architecture Admin Policies Network Model Global Broker Global demand to TE-Server Usage Limits Data Center Site Broker Data Center Site Broker (optional)

High Level Architecture TE and B/w Allocation Control Plane B/W Broker TE Server SDN API SDN Gateway to Sites Collection / Enforcement Traffic sources for WAN SDN WAN (N sites) Data Plane

TE Server Architecture Global Broker Demand Matrix {src, dst --> utility curve } Flow Manager Site level edges with RTT and Capacity Topology Manager Abstract Path Assignment {src, dst --> paths and weights } Path Allocation Algorithm Interface up/down status Path Selection OFC S1 TE Server Per Site Path Manipulation Commands Gateway OFC Sn Devices Devices

High Level Architecture TE and B/w Allocation Control Plane B/W Broker TE Server SDN API SDN Gateway to Sites Collection / Enforcement Traffic sources for WAN SDN WAN (N sites) Data Plane

Controller Architecture TE Server / SDN Gateway Topo / routes Routing (Quagga) Tunneling App TE ops Flows OFC Flows OFA... HW Tables Switches in DC 1

Controller Architecture TE Server SDN Gateway... Site 2... Site 1 non-te (ISIS) path... Site 3

Sample Utilization

Benefits of Aggregation

Convergence under Failures failure notification A B old tunnel C new tunnel no-te: traffic drop ~ 9 sec with-te: traffic drop ~ 1 sec TE Server new ops Without TE: Failure detection and convergence is slower: Delay 'inside' TE << timers for detecting and communicating failures (in ISIS) Fast failover may be milliseconds, but not guaranteed to be either accurate or "good"

G-Scale WAN History SDN fully Deployed Exit testing "opt in" network SDN rollout Central TE Deployed

Range of Failure Scenarios TE1* GW1 TE2 GW2 Potential failure condition * indicates mastership OFC1* OFC2 OFC1 OFC2* Router Router

Trust but Verify: Consistency Checks TE View OFC View Is Valid Comment Clean Clean yes Normal operation. Clean Dirty no OFC remains dirty forever Clean Missing no OFC will forever miss entry Dirty Dirty yes Both think Op failed Dirty Clean yes Op succeeded but response not yet received by TE Dirty Missing yes Op issued but not received by OFC Missing Clean no OFC has extra entry, and will remain like that Missing Dirty no (same as above)

Implications for ISPs Dramatically reduce the cost of WAN deployment o Cheaper per bps in both CapEx and OpEx o Less overprovisioning for same SLAs Differentiator for end customers o Less cost for same BW or more BW for same cost Possible to deploy incrementally in pre-existing network o Deployment experience with Google's global SDN production WAN suggests SDN is real and it works o But it s just the beginning

Conclusions Large-scale data processing needs scale out networking Unlocking the potential of modern server hardware for at scale problems requires orders of magnitude improvement in network performance Scale out networking requires large-scale data management Experience with Google s SDN WAN suggests that logically centralized state management critical for costeffective deployment and management Still in the stone ages in dynamically managing state and getting updates to the right places in the network

Thank you!