Trevi: Watering down storage hotspots with cool fountain codes Toby Moncaster University of Cambridge
Trevi summary Ø Trevi is a cool new approach to data centre storage Ø based on exis;ng ideas that are known to work well Ø It leverages fountain coding to give a number of key advantages: Resilience to loss Data replica;on for free (Reliable) Mul;cas;ng of writes Mul;sourcing of reads Full support for mul;ple network paths Ø Trevi works on commodity hardware needs no changes to the applica;on or network Ø Trevi doesn t need TCP or any other reliable transport 2
Background Ø Commodity data centres are here to stay Amazon, Google, etc Ø This has led to lots of new ideas and innova;ons: Novel programming paradigms: MapReduce, Hadoop, Ciel New network topologies: FatTree, VL2, CamCube Alterna;ve transport protocols: MPTCP, DCTCP, HULL Ø But storage was the poor rela;on (un;l recently ) GFS/Colossus is distributed blob store with central metadata server Flat Datacenter Storage advocates a distributed metadata approach Ø Both are liable to TCP- related problems such as incast and unnecessary re- transmissions 3
Brief reminder on Fountain Codes Ø Fountain codes are a form of reliable mul;cast [1] Ø They are rateless and loss- tolerant Ø Based on sparse erasure codes (Luby Transform,Tornado codes) Ø 1-7% overhead (depends on sta.s.cal distribu.on) Ø Use XOR for encode and decode opera;ons C 1 C 1 = D 1 Receive C 1 C 2 C 4 C 5 C 7 D 1 C 2 C 2 = D 1 + D 2 + D 4 C 1 à D 1 D 2 D 3 C 3 C 4 C 3 = D 2 + D 3 C 4 = D 3 + D 4 Transmit C 7 à D 5 C 5 + D 5 à D 4 D 4 C 5 C 6 C 5 = D 4 + D 5 C 6 = D 1 + D 3 + D 5 C 4 + D 4 à D 3 C 2 + D 1 à C (= D 2 + D 4 ) D 5 C 7 C 7 = D 5 C + D 4 à D 2 [1] A digital fountain approach to asynchronous reliable mul.cast. Byers, Luby et. Al. 4
Ø Wri;ng Data (mul;cast): Strawman Design Source encodes data chunks as symbols Symbols mul;cast to a subset of storage nodes If symbols are lost, keep sending new ones un;l recovered Once all storage nodes have all the data, stop transmibng Reading Data (mul;sourcing): Client requests data from set of storage nodes Each nodes starts crea;ng symbols and transmit them Nodes randomise the coding so all symbols provide new informa;on to the client Client receives data in parallel from all servers Once data is received, tell storage nodes to stop 5
Strawman Issues Our strawman has several obvious issues: Ø Wasted bandwidth If you transmit un;l all nodes send stop you will waste BW All nodes see all traffic during writes (due to mul;cast) Ø Lack of fairness If you transmit at linerate you squeeze out other traffic If you transmit at a lower rate you might s;ll trigger conges;on Ø Overload at controller If you transmit at linerate, slower storage controllers will be congested Storage controllers may fail under excess load snowball effect 6
Detailed design Ø Based on Flat Data Center Storage: Physical storage divided into blobs at, data divided into 8Mb tracts Each blob has Tract Server which controls loca;on of data tracts Each node has a copy of the Tract Locator Table (lazily replicated) If node g wants tract i in TLT length l: use {hash(g) + i}mod (l) to get line no. TLT has mul;ple columns where each column = 1 replica Includes mechanisms to deal with stale data, node failure, etc Ø Our system adds a mul;cast address to each TLT entry Ø Data is sent as encoded symbols, but stored as actual data chunks Ø Use a receiver- driven flow control Receiver dictates rate that symbols are sent. Rate determined by min of the rate at which each storage server can store data the rate at which a sender can send data the conges;on in the network. 7
Ø Predic;ve flow control Use a hybrid push- pull model PotenIal refinements Send enough symbols to ensure recep;on in absence of loss Then use pull approach to fill in gaps Ø Priority and scavenging Trevi is ideal for scavenger- style transports In absence of compe;ng traffic can transmit at any rate If compe;ng traffic present, then slow down sending rate Virtual Queues can be used to measure this (especially at final hop) Ø Op;mising for slow writes If you are wri;ng data to mul;ple nodes, one may be much slower This has big impact on overall speed of writes. 2 solu;ons: 1) Remove the slow node from the mul;cast group 2) Ignore the slower node if it becomes overwhelmed it can unsubscribe 8
Trevi summary (revisited) Trevi uses fountain codes to provide scalable data centre storage Ø Resilience to loss : no retransmissions, no ;meouts Ø Data replica.on for free: Mul;cast is built in so mul;ple copies of each blob are always stored Ø (Reliable) Mul.cas.ng of writes: Makes for simple management of replica;on groups just subscribe/unsubscribe Ø Mul.sourcing of reads: Each node generates a difference set of symbols, so all symbols provide new informa;on Ø Full support for mul.ple network paths: With careful planning mul;cast can make full use of available paths Ø Trevi works on commodity hardware: Although hardware offload might speed up the XOR opera;ons (e.g. NetFPGA) Ø It needs no changes to the applica.on: built on top of UDP and uses a simple shim layer between the app and network 9
QuesIons? Toby.Moncaster@cl.cam.ac.uk George.Parisis@cl.cam.ac.uk Anil.Madhavapeddy@cl.cam.ac.uk Jon.Crowcror@cl.cam.ac.uk