Trevi: Watering down storage hotspots with cool fountain codes. Toby Moncaster University of Cambridge

Similar documents
Mul$media Networking. #3 Mul$media Networking Semester Ganjil PTIIK Universitas Brawijaya. #3 Requirements of Mul$media Networking

A Digital Fountain Approach to Reliable Distribution of Bulk Data

Ant Rowstron. Joint work with Paolo Costa, Austin Donnelly and Greg O Shea Microsoft Research Cambridge. Hussam Abu-Libdeh, Simon Schubert Interns

Hadoop Architecture. Part 1

Advanced Computer Networks. Scheduling

Transforming cloud infrastructure to support Big Data Ying Xu Aspera, Inc

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

MMPTCP: A Novel Transport Protocol for Data Centre Networks

Cloud Computing at Google. Architecture

Distributed systems Lecture 6: Elec3ons, consensus, and distributed transac3ons. Dr Robert N. M. Watson

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Networking in the Hadoop Cluster

Internet Storage Sync Problem Statement

Multipath TCP design, and application to data centers. Damon Wischik, Mark Handley, Costin Raiciu, Christopher Pluntke

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

CS 91: Cloud Systems & Datacenter Networks Failures & Replica=on

Introduc7on to Computer Networks

Project Overview. Collabora'on Mee'ng with Op'mis, Sept. 2011, Rome

Load Balancing Mechanisms in Data Center Networks

Scaling IP Mul-cast on Datacenter Topologies. Xiaozhou Li Mike Freedman

Offensive & Defensive & Forensic Techniques for Determining Web User Iden<ty

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

Cloudian The Storage Evolution to the Cloud.. Cloudian Inc. Pre Sales Engineering

TECHNICAL WHITE PAPER: ELASTIC CLOUD STORAGE SOFTWARE ARCHITECTURE

Large-Scale Distributed Systems. Datacenter Networks. COMP6511A Spring 2014 HKUST. Lin Gu

Data Center DC planning for the next 5 10 years. Copyright Experture and Robert Frances Group, all rights reserved

Apache Hadoop. Alexandru Costan

Using RDBMS, NoSQL or Hadoop?

RAID. Tiffany Yu-Han Chen. # The performance of different RAID levels # read/write/reliability (fault-tolerant)/overhead

Giving life to today s media distribution services

DNS Big Data

Technical Overview Simple, Scalable, Object Storage Software

Energy Efficient MapReduce

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

Final for ECE374 05/06/13 Solution!!


Transport Layer Protocols

Hadoop and Map-Reduce. Swati Gore

1. The subnet must prevent additional packets from entering the congested region until those already present can be processed.

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Apache Hadoop FileSystem and its Usage in Facebook

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

(Refer Slide Time: 02:17)

Hadoop Cluster Applications

Computer Networks. Examples of network applica3ons. Applica3on Layer

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Gatekeeper: Supporting Bandwidth Guarantees for Multi-tenant Datacenter Networks

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Data Center Infrastructure of the future. Alexei Agueev, Systems Engineer

Lecture 14: Data transfer in multihop wireless networks. Mythili Vutukuru CS 653 Spring 2014 March 6, Thursday

Apache Hadoop FileSystem Internals

A SENSIBLE GUIDE TO LATENCY MANAGEMENT

March 10 th 2011, OSG All Hands Mee6ng, Network Performance Jason Zurawski Internet2 NDT

Design and Evolution of the Apache Hadoop File System(HDFS)

Texas Digital Government Summit. Data Analysis Structured vs. Unstructured Data. Presented By: Dave Larson

IMPROVING QUALITY OF VIDEOS IN VIDEO STREAMING USING FRAMEWORK IN THE CLOUD

CHAPTER 8 CONCLUSION AND FUTURE ENHANCEMENTS

Google File System. Web and scalability

Ring Protection: Wrapping vs. Steering

Video Streaming with Network Coding

CS 91: Cloud Systems & Datacenter Networks Networks Background

Retaining globally distributed high availability Art van Scheppingen Head of Database Engineering

What Is It? Business Architecture Research Challenges Bibliography. Cloud Computing. Research Challenges Overview. Carlos Eduardo Moreira dos Santos

Load Balancing in Data Center Networks

Kaseya Fundamentals Workshop DAY THREE. Developed by Kaseya University. Powered by IT Scholars

Data Center Networking with Multipath TCP

Transport Services (TAPS) BOF plan

Further considera/ons on data center conges/on control.

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

GigaSpaces Real-Time Analytics for Big Data

Behavior Analysis of TCP Traffic in Mobile Ad Hoc Network using Reactive Routing Protocols

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

Can Cloud Hos+ng Providers Really Replace. Your Cri(cal IT Infrastructure?

Introduction to Cloud Computing

Operating Systems. Cloud Computing and Data Centers

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

TCP in Wireless Mobile Networks

TCP for Wireless Networks

Paolo Costa

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications

Background. Personal cloud services are gaining popularity

Scalus A)ribute Workshop. Paris, April 14th 15th

Enterprise QoS. Tim Chung Google Corporate Netops Architecture Nanog 49 June 15th, 2010

Object Storage: Out of the Shadows and into the Spotlight

Computer Networks COSC 6377

Neil Stobart Cloudian Inc. CLOUDIAN HYPERSTORE Smart Data Storage

Distributed Systems. 23. Content Delivery Networks (CDN) Paul Krzyzanowski. Rutgers University. Fall 2015

Multi-Datacenter Replication

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

QoS issues in Voice over IP

Advanced Computer Networks. Datacenter Network Fabric

Quantum StorNext. Product Brief: Distributed LAN Client

Hadoop Distributed File System (HDFS) Overview

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

How To Create A P2P Network

Network edge and network core. millions of connected compu?ng devices: hosts = end systems running network apps

Windows Azure Storage Scaling Cloud Storage Andrew Edwards Microsoft

NextGen Infrastructure for Big DATA Analytics.

Transcription:

Trevi: Watering down storage hotspots with cool fountain codes Toby Moncaster University of Cambridge

Trevi summary Ø Trevi is a cool new approach to data centre storage Ø based on exis;ng ideas that are known to work well Ø It leverages fountain coding to give a number of key advantages: Resilience to loss Data replica;on for free (Reliable) Mul;cas;ng of writes Mul;sourcing of reads Full support for mul;ple network paths Ø Trevi works on commodity hardware needs no changes to the applica;on or network Ø Trevi doesn t need TCP or any other reliable transport 2

Background Ø Commodity data centres are here to stay Amazon, Google, etc Ø This has led to lots of new ideas and innova;ons: Novel programming paradigms: MapReduce, Hadoop, Ciel New network topologies: FatTree, VL2, CamCube Alterna;ve transport protocols: MPTCP, DCTCP, HULL Ø But storage was the poor rela;on (un;l recently ) GFS/Colossus is distributed blob store with central metadata server Flat Datacenter Storage advocates a distributed metadata approach Ø Both are liable to TCP- related problems such as incast and unnecessary re- transmissions 3

Brief reminder on Fountain Codes Ø Fountain codes are a form of reliable mul;cast [1] Ø They are rateless and loss- tolerant Ø Based on sparse erasure codes (Luby Transform,Tornado codes) Ø 1-7% overhead (depends on sta.s.cal distribu.on) Ø Use XOR for encode and decode opera;ons C 1 C 1 = D 1 Receive C 1 C 2 C 4 C 5 C 7 D 1 C 2 C 2 = D 1 + D 2 + D 4 C 1 à D 1 D 2 D 3 C 3 C 4 C 3 = D 2 + D 3 C 4 = D 3 + D 4 Transmit C 7 à D 5 C 5 + D 5 à D 4 D 4 C 5 C 6 C 5 = D 4 + D 5 C 6 = D 1 + D 3 + D 5 C 4 + D 4 à D 3 C 2 + D 1 à C (= D 2 + D 4 ) D 5 C 7 C 7 = D 5 C + D 4 à D 2 [1] A digital fountain approach to asynchronous reliable mul.cast. Byers, Luby et. Al. 4

Ø Wri;ng Data (mul;cast): Strawman Design Source encodes data chunks as symbols Symbols mul;cast to a subset of storage nodes If symbols are lost, keep sending new ones un;l recovered Once all storage nodes have all the data, stop transmibng Reading Data (mul;sourcing): Client requests data from set of storage nodes Each nodes starts crea;ng symbols and transmit them Nodes randomise the coding so all symbols provide new informa;on to the client Client receives data in parallel from all servers Once data is received, tell storage nodes to stop 5

Strawman Issues Our strawman has several obvious issues: Ø Wasted bandwidth If you transmit un;l all nodes send stop you will waste BW All nodes see all traffic during writes (due to mul;cast) Ø Lack of fairness If you transmit at linerate you squeeze out other traffic If you transmit at a lower rate you might s;ll trigger conges;on Ø Overload at controller If you transmit at linerate, slower storage controllers will be congested Storage controllers may fail under excess load snowball effect 6

Detailed design Ø Based on Flat Data Center Storage: Physical storage divided into blobs at, data divided into 8Mb tracts Each blob has Tract Server which controls loca;on of data tracts Each node has a copy of the Tract Locator Table (lazily replicated) If node g wants tract i in TLT length l: use {hash(g) + i}mod (l) to get line no. TLT has mul;ple columns where each column = 1 replica Includes mechanisms to deal with stale data, node failure, etc Ø Our system adds a mul;cast address to each TLT entry Ø Data is sent as encoded symbols, but stored as actual data chunks Ø Use a receiver- driven flow control Receiver dictates rate that symbols are sent. Rate determined by min of the rate at which each storage server can store data the rate at which a sender can send data the conges;on in the network. 7

Ø Predic;ve flow control Use a hybrid push- pull model PotenIal refinements Send enough symbols to ensure recep;on in absence of loss Then use pull approach to fill in gaps Ø Priority and scavenging Trevi is ideal for scavenger- style transports In absence of compe;ng traffic can transmit at any rate If compe;ng traffic present, then slow down sending rate Virtual Queues can be used to measure this (especially at final hop) Ø Op;mising for slow writes If you are wri;ng data to mul;ple nodes, one may be much slower This has big impact on overall speed of writes. 2 solu;ons: 1) Remove the slow node from the mul;cast group 2) Ignore the slower node if it becomes overwhelmed it can unsubscribe 8

Trevi summary (revisited) Trevi uses fountain codes to provide scalable data centre storage Ø Resilience to loss : no retransmissions, no ;meouts Ø Data replica.on for free: Mul;cast is built in so mul;ple copies of each blob are always stored Ø (Reliable) Mul.cas.ng of writes: Makes for simple management of replica;on groups just subscribe/unsubscribe Ø Mul.sourcing of reads: Each node generates a difference set of symbols, so all symbols provide new informa;on Ø Full support for mul.ple network paths: With careful planning mul;cast can make full use of available paths Ø Trevi works on commodity hardware: Although hardware offload might speed up the XOR opera;ons (e.g. NetFPGA) Ø It needs no changes to the applica.on: built on top of UDP and uses a simple shim layer between the app and network 9

QuesIons? Toby.Moncaster@cl.cam.ac.uk George.Parisis@cl.cam.ac.uk Anil.Madhavapeddy@cl.cam.ac.uk Jon.Crowcror@cl.cam.ac.uk