Paolo Costa costa@imperial.ac.uk

Similar documents
Camdoop: Exploiting In-network Aggregation for Big Data Applications

Towards Rack-scale Computing Challenges and Opportunities

Operating Systems. Cloud Computing and Data Centers

Big Data and Apache Hadoop s MapReduce

Advanced Computer Networks. Datacenter Network Fabric

Load Balancing Mechanisms in Data Center Networks

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Hadoop Architecture. Part 1

How To Scale Out Of A Nosql Database

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Big Data in the Enterprise: Network Design Considerations

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Hadoop IST 734 SS CHUNG

Load Balancing in Data Center Networks

Comparative analysis of Google File System and Hadoop Distributed File System

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Computer Networks COSC 6377

Large-Scale Data Processing

Network Virtualization and Data Center Networks Data Center Virtualization - Basics. Qin Yin Fall Semester 2013

Networking Virtualization Using FPGAs

NoSQL Data Base Basics

Optimizing Data Center Networks for Cloud Computing

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

Shared Parallel File System

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

CSE-E5430 Scalable Cloud Computing Lecture 2

MMPTCP: A Novel Transport Protocol for Data Centre Networks

Intro to Map/Reduce a.k.a. Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Network performance in virtual infrastructures

MapReduce and Hadoop Distributed File System V I J A Y R A O

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

Introduction to Big Data Training

How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda

How To Build A Policy Aware Switching Layer For Data Center Data Center Servers

Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya and Amin Vahdat

Data Center Networks

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Walmart s Data Center. Amadeus Data Center. Google s Data Center. Data Center Evolution 1.0. Data Center Evolution 2.0

Cloud Computing Where ISR Data Will Go for Exploitation

Data Center Network Topologies: FatTree

MapReduce with Apache Hadoop Analysing Big Data

Advanced Computer Networks. Scheduling

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

VON/K: A Fast Virtual Overlay Network Embedded in KVM Hypervisor for High Performance Computing

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Portland: how to use the topology feature of the datacenter network to scale routing and forwarding

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Cisco s Massively Scalable Data Center

Comparing the Network Performance of Windows File Sharing Environments

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Lecture 7: Data Center Networks"

Question: 3 When using Application Intelligence, Server Time may be defined as.

GraySort and MinuteSort at Yahoo on Hadoop 0.23

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

Large-Scale Distributed Systems. Datacenter Networks. COMP6511A Spring 2014 HKUST. Lin Gu

SMB Direct for SQL Server and Private Cloud

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Application Development. A Paradigm Shift

B4: Experience with a Globally-Deployed Software Defined WAN TO APPEAR IN SIGCOMM 13

Transparent and Flexible Network Management for Big Data Processing in the Cloud

Chapter 7. Using Hadoop Cluster and MapReduce

A Comparative Study of Data Center Network Architectures

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Longer is Better? Exploiting Path Diversity in Data Centre Networks

Data Center Networking with Multipath TCP

Data-Intensive Computing with Map-Reduce and Hadoop

Software Defined Network Support for Real Distributed Systems

Performance characterization report for Microsoft Hyper-V R2 on HP StorageWorks P4500 SAN storage

MapReduce and Hadoop Distributed File System

The Performance Characteristics of MapReduce Applications on Scalable Clusters

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Multipath TCP in Data Centres (work in progress)

Yahoo! Cloud Serving Benchmark

PortLand:! A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

Hadoop and Map-Reduce. Swati Gore

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

A Performance Analysis of Distributed Indexing using Terrier

MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM

OpenFlow and Onix. OpenFlow: Enabling Innovation in Campus Networks. The Problem. We also want. How to run experiments in campus networks?

Data Center Network Topologies: VL2 (Virtual Layer 2)

Data Center Network Architectures

Towards Predictable Datacenter Networks

Coding Techniques for Efficient, Reliable Networked Distributed Storage in Data Centers

Core and Pod Data Center Design

Windows Server 2008 R2 Hyper-V Live Migration

Self-Compressive Approach for Distributed System Monitoring

Data Mining with Hadoop at TACC

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Transcription:

joint work with Ant Rowstron, Austin Donnelly, and Greg O Shea (MSR Cambridge) Hussam Abu-Libdeh, Simon Schubert (Interns) Paolo Costa costa@imperial.ac.uk

Paolo Costa CamCube - Rethinking the Data Center Cluster 2/54

Data Centers Large group of networked servers Typically built from commodity hardware Scale-out vs. scale-up Many low-cost PCs rather than few expensive high-end servers Economies of scale Custom software stack Microsoft: Autopilot, Dryad Google: GFS, BigTable, Reduce Yahoo: Hadoop, HDFS Amazon: Dynamo, Astrolabe Facebook: Cassandra, Scribe, HBase Focus of this talk Paolo Costa CamCube - Rethinking the Data Center Cluster 3/54

Mismatch between abstraction & reality How programmers see the network Key-based Overlay networks Explicit structure Symmetry of role Paolo Costa CamCube - Rethinking the Data Center Cluster 4/54

Paolo Costa CamCube - Rethinking the Data Center Cluster 5/54

~12 racks / pod 10 Gbps 10 Gbps ~ 20-40 servers / rack Paolo Costa CamCube - Rethinking the Data Center Cluster 6/54

~12 racks / pod 10 Gbps 10 Gbps ~ 20-40 servers / rack Paolo Costa CamCube - Rethinking the Data Center Cluster 7/54

~12 racks / pod 10 Gbps 10 Gbps Data center services are distributed but they have no control over the network Must infer network properties (e.g., locality) Must use inefficient overlay networks No bandwidth guarantee ~ 20-40 servers / rack Paolo Costa CamCube - Rethinking the Data Center Cluster 8/54

Top-down (applications) Custom routing? Overlays Priority traffic? Use multiple TCP flows Locality? Use traceroute and ping TCP Incast? Add random jitter Bottom-up (network) More bandwidth? Use richer topologies (e.g., fat-trees) Fate-sharing? Use prediction models to estimate flow length / traffic patterns TCP Incast? Use ECN in routers Paolo Costa CamCube - Rethinking the Data Center Cluster 9/24

The network is still a black box Good for a multi-owned, heterogeneous Internet Limiting for a single-owned, homogenous data center DCs are not mini-internets single owner / administration domain we know (and define) the topology low hardware/software diversity no malicious or selfish node Is TCP/IP still appropriate? Paolo Costa CamCube - Rethinking the Data Center Cluster 10/54

A data center from the perspective of a distributed systems builder What happens if you integrate network & compute? Distributed Systems CamCube HPC Networking Paolo Costa CamCube - Rethinking the Data Center Cluster 11/54

Service composition Architecture TCP/IP Link and key-based API Topology Paolo Costa CamCube - Rethinking the Data Center Cluster 12/54

y (1,2,1) (1,2,2) (1,2,2) (1,2,1) No distinction between overlay and underlying network Nodes have (x,y,z)coordinate defines key-space Simple one hop API to send/receive packets all other functionality implemented as services Key coordinates are re-mapped in case of failures enable a KBR-like API z x Paolo Costa CamCube - Rethinking the Data Center Cluster 13/54

y (1,2,2) (1,2,1) No distinction between overlay and underlying network Nodes have (x,y,z)coordinate defines key-space Simple one hop API to send/receive packets all other functionality implemented as services Key coordinates are re-mapped in case of failures enable a KBR-like API z x Paolo Costa CamCube - Rethinking the Data Center Cluster 14/54

Fetch mail costa:1234 (0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 15/54

Fetch mail costa:1234 (0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 16/54

Fetch mail costa:1234 (0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 17/54

Fetch mail costa:1234 (0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 18/54

Fetch mail costa:1234 (0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 19/54

Fetch mail costa:1234 (0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 20/54

Fetch mail costa:1234 (0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 21/54

Fetch mail costa:1234 (0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 22/54

Fetch mail costa:1234 (0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 23/54

Fetch mail costa:1234 (0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 24/54

Fetch mail costa:1234 (0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 25/54

Fetch mail costa:1234 (0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 26/54

Fetch mail costa:1234 (0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 27/54

(0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 28/54

(1,1,0) (0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 29/54

(1,1,0) (0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 30/54

(1,1,0) (0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 31/54

(0,0,0) (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 32/54

27-server (3x3x3) prototype quad-core Intel Xeon 5520 2.27 Ghz 12GB RAM 6 Intel PRO/1000 PT 1 Gbps ports Runtime & services implemented in C# Multicast, key-value store, Reduce, graph processing engine, Experiment all servers saturating 6 links 11.94 Gbps (9K pkts) using < 22% CPU Paolo Costa CamCube - Rethinking the Data Center Cluster 33/54

Reduce originally developed by Google open-source version (Hadoop) used by several companies Yahoo, Facebook, Twitter, Amazon, many applications data mining, search, indexing Simple abstraction map: apply a function that generates (key, value) pairs reduce: combine intermediate results Paolo Costa CamCube - Rethinking the Data Center Cluster 34/54

Reduce Paolo Costa CamCube - Rethinking the Data Center Cluster 35/54

Results stored on a single node Can handle all reduce functions Bottleneck (network and CPU) Reduce Reduce Paolo Costa CamCube - Rethinking the Data Center Cluster 36/54

CPU & network load distributed N 2 flows: needs full bisection bw results are split across R servers not always applicable (e.g., max) Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Paolo Costa CamCube - Rethinking the Data Center Cluster 37/54

Reduce Paolo Costa CamCube - Rethinking the Data Center Cluster 38/54

Use hierarchical aggregation! Reduce Paolo Costa CamCube - Rethinking the Data Center Cluster 39/54

Paolo Costa CamCube - Rethinking the Data Center Cluster 40/54

Intermediate results can be aggregated at each hop Aggregate Aggregate Aggregate Paolo Costa CamCube - Rethinking the Data Center Cluster 41/54

Paolo Costa CamCube - Rethinking the Data Center Cluster 42/54

In-network aggregation Paolo Costa CamCube - Rethinking the Data Center Cluster 43/54

Traffic reduced at each hop No need for full bisection bw Final results stored on a single node In-network aggregation Paolo Costa CamCube - Rethinking the Data Center Cluster 44/54

Camdoop builds 1 tree failure resilient (up to) 6 Gbps throughput / server Paolo Costa CamCube - Rethinking the Data Center Cluster 45/54

Camdoop builds 1 tree 6 disjoint trees failure resilient (up to) 6 Gbps throughput / server Paolo Costa CamCube - Rethinking the Data Center Cluster 46/54

27-server testbed 27 GB input size 1GB / server Output size/ intermediate size (S) S=1/N 0 (full aggregation) S=1 (no aggregation) Two configurations: All-to-one All-to-all Paolo Costa CamCube - Rethinking the Data Center Cluster 47/54

Time (s) logscale Worse 1000 100 switch 10 Camdoop (no agg) Camdoop Switch 1 Gbps (bound) 1 0 0.2 0.4 0.6 0.8 1 Full aggregation Output size / intermediate size (S) No aggregation Better Paolo Costa CamCube - Rethinking the Data Center Cluster 48/54

Time (s) logscale Impact of aggregation Performance independent of S Impact of higher bandwidth Worse 1000 100 Better 10 Facebook reported aggregation ratio switch Camdoop (no agg) Camdoop Switch 1 Gbps (bound) 1 0 0.2 0.4 0.6 0.8 1 Full aggregation Output size / intermediate size (S) No aggregation Paolo Costa CamCube - Rethinking the Data Center Cluster 49/54

Time (s) logscale Worse 1000 100 switch Camdoop (no agg) Camdoop Switch 1 Gbps (bound) 10 1 0 0.2 0.4 0.6 0.8 1 Output size / intermediate size (S) Full aggregation No aggregation Better Paolo Costa CamCube - Rethinking the Data Center Cluster 50/54

Time (s) logscale Impact of higher bandwidth Worse 1000 Impact of aggregation 100 switch Camdoop (no agg) Camdoop Switch 1 Gbps (bound) 10 1 0 0.2 0.4 0.6 0.8 1 Output size / intermediate size (S) Full aggregation No aggregation Better Paolo Costa CamCube - Rethinking the Data Center Cluster 51/54

Time (ms) Worse 60 50 40 switch Camdoop (no agg) Camdoop 30 20 10 Better 0 20 KB 200 KB Input data size / server Paolo Costa CamCube - Rethinking the Data Center Cluster 52/54

Time (ms) In-network aggregation beneficial also for low-latency services Worse 60 50 40 switch Camdoop (no agg) Camdoop 30 20 10 Better 0 20 KB 200 KB Input data size / server Paolo Costa CamCube - Rethinking the Data Center Cluster 53/54

Building large-scale services is hard mismatch between physical and logical topology the network is a black box CamCube designed from the perspective of developers explicit topology and key-based abstraction Benefits simplified development more efficient applications and better fault-tolerance H. Abu-Libdeh, P. Costa, A. Rowstron, G. O'Shea, and A. Donnelly. Symbiotic Routing for Future Data Centres, In ACM SIGCOMM, 2010. P. Costa, A. Donnelly, A. Rowstron, and G. O'Shea. Camdoop: Exploiting Innetwork Aggregation for Big Data Applications, In USENIX NSDI, 2012.

Paolo Costa http://www.doc.ic.ac.uk/~costa