A Solution to the Network Chllenges of Dt Recovery in Ersure-coded Distributed Storge Systems: A Study on the Fcebook Wrehouse Cluster K V Rshmi, Nihr Shh, D Gu, H Kung, D Borthkur, K Rmchndrn
Outline Introducon: Ersure coding in dt centers Low storge, high fult- tolernce High downlod & disk IO during recovery Mesurements from Fcebook wrehouse cluster in producon Proposed lternve: Piggybcked- RS codes Sme storge overhed & fult tolernce 30% reducon in downlod & disk IO
Outline Introducon: Ersure coding in dt centers Low storge, high fult- tolernce High downlod & disk IO during recovery Mesurements from Fcebook wrehouse cluster in producon Proposed lternve: Piggybcked- RS codes Sme storge overhed & fult tolernce 30% reducon in downlod & disk IO
Need for Redundnt Storge Frequent unvilbility in dt- centers commodity components fil frequently solwre glitches, mintennce shutdowns, power filures Redundncy gives more relibility nd vilbility
Populr pproch: Replicon Mulple copies of dt cross mchines Eg, GFS, HDFS store 3 replics by defult block 1 block 2 block 3 block 4 b b Typiclly stored cross different rcks, b: dt blocks
Petbyte Scle dt: Replicon expensive Modertely sized dt: storge is chep replicon vible Mulple tens of PBs ggregte storge no longer chep replicon is expensive
Ersure Codes Repliction Reed-Solomon (RS) code block 1 block 1 dt blocks block 2 block 2 b block 3 b block 3 +b prity blocks block 4 b block 4 +2b Redundncy 2x 2x
Ersure Codes Repliction Reed-Solomon (RS) code block 1 block 1 dt blocks block 2 block 2 b block 3 b block 3 +b prity blocks block 4 b block 4 +2b Redundncy 2x 2x First order comprison: tolertes ny one filure tolertes ny two filures
Ersure Codes Repliction Reed-Solomon (RS) code block 1 block 1 dt blocks block 2 block 2 b block 3 b block 3 +b prity blocks block 4 b block 4 +2b Redundncy 2x 2x First order comprison: tolertes ny one filure tolertes ny two filures
Ersure Codes Repliction Reed-Solomon (RS) code block 1 block 1 dt blocks block 2 block 2 b block 3 b block 3 +b prity blocks block 4 b block 4 +2b Redundncy 2x 2x First order comprison: tolertes ny one filure tolertes ny two filures
Ersure Codes Repliction Reed-Solomon (RS) code block 1 block 1 dt blocks block 2 block 2 b block 3 b block 3 +b prity blocks block 4 b block 4 +2b Redundncy 2x 2x First order comprison: tolertes ny one filure tolertes ny two filures
Ersure Codes Repliction Reed-Solomon (RS) code block 1 block 1 dt blocks block 2 block 2 b block 3 b block 3 +b prity blocks block 4 b block 4 +2b Redundncy 2x 2x First order comprison: In generl: tolertes ny one filure lower MTTDL, high storge requirement tolertes ny two filures order of mgnitude higher MTTDL with much lesser storge
Ersure Codes Using RS codes insted of 3- replicon on less- frequently ccessed dt hs led to svings of mulple Petbytes in the Fcebook Wrehouse cluster
Reed- Solomon (RS) Codes (#dt, #prity) RS code: tolertes filure of ny #prity blocks these (#dt + #prity) blocks constute stripe Fcebook wrehouse cluster uses (10, 4) RS code Exmple: (2, 2) RS code b +b +2b #dt = 2 (dt blocks) #prity = 2 (prity blocks) 4 blocks in stripe
Why RS codes? Mximum possible fult- tolernce for storge overhed storge- cpcity opml mximum- distnce- seprble (MDS) (in coding theory prlnce) Flexibility in choice of prmeters Supports ny #dt nd #prity
Why RS codes? Mximum possible fult- tolernce for storge overhed storge- cpcity opml mximum- distnce- seprble (MDS) (in coding theory prlnce) Flexibility in choice of prmeters Supports ny #dt nd #prity However result in incresed downlod nd disk IO during dt recovery
Dt Recovery: Incresed downlod & disk IO Repliction block 1 block 2 block 3 b Downlod & IO 1x block 4 b
Dt Recovery: Incresed downlod & disk IO Repliction Reed-Solomon code block 1 block 2 block 3 b Downlod & IO 1x block 1 block 2 block 3 b +b b +b Downlod & IO 2x block 4 b block 4 +2b
Dt Recovery: Incresed downlod & disk IO Repliction Reed-Solomon code block 1 block 2 block 3 b Downlod & IO 1x block 1 block 2 block 3 b +b b +b Downlod & IO 2x block 4 b block 4 +2b In generl Downlod & IO required = #dt x (size of dt to be recovered)
Dt Recovery: Burden on TOR switches AS/Router TOR TOR TOR TOR b + + b 2b node 1 node 2 node 3 node 4 Burdens the lredy oversubscribed Top- of- Rck nd higher level switches
Outline Introducon: Ersure coding in dt centers Low storge, high fult- tolernce High downlod & disk IO during recovery Mesurements from Fcebook wrehouse cluster in producon Proposed lternve: Piggybcked- RS codes Sme storge overhed & fult tolernce 30% reducon in downlod & disk IO
Brief System Descripon HDFS cluster with mulple thousnds of nodes Mulple tens of PBs nd growing Dt immutble unl deleted Reducing storge requirements is of high importnce
Brief System Descripon HDFS cluster with mulple thousnds of nodes Mulple tens of PBs nd growing Dt immutble unl deleted Reducing storge requirements is of high importnce Uses (10, 4) RS code to reduce storge requirements on less- frequently ccessed dt Mulple PBs of RS coded dt
Brief System Descripon 256 Mbytes dt blocks block 1 block 2 block 10
Brief System Descripon dt blocks block 1 block 2 block 10 1 byte 256 Mbytes prity blocks block 11 block 14
Mchine Unvilbility Events From HDFS Nme- Node logs Logged when no hert- bet for > 15min Blocks mrked unvilble, periodic recovery process #mchine-unvilbility events logged" Dy" Medin of 50 mchine- unvilbility events logged per dy
Missing blocks per stripe # blocks missing in stripe % of stripes with missing blocks 1 9808 2 187 3 0036 4 9 x 10-6 5 9 x 10-9 Dominnt scenrio: Single block recovery
#Blocks Recovered & Cross- rck Trnsfers Medin of 180 TB trnsferred cross rcks per dy for recovery operons Around 5 mes tht under 3- replicon
Outline Introducon: Ersure coding in dt centers Low storge, high fult- tolernce High downlod & disk IO during recovery Mesurements from Fcebook wrehouse cluster in producon Proposed lternve: Piggybcked- RS codes Sme storge overhed & fult tolernce 30% reducon in downlod & disk IO
Piggybcking: Toy Exmple Step 1: Tke (2, 2) Reed- Solomon code dt blocks block 1 block 2 1 2 b 1 b 2 prity blocks block 3 block 4 1 + 2 1 +2 2 b 1 +b 2 b 1 +2b 2 1 byte 1 byte
Piggybcking: Toy Exmple (In (2,2) RS code: recovery downlod & IO = 4 bytes) block 1 1 b 1 2 b 2 1 + 2 b 1 +b 2 block 2 2 b 2 block 3 1 + 2 b 1 +b 2 block 4 1 +2 2 b 1 +2b 2
Piggybcking: Toy Exmple Step 2: Add piggybcks to prity nodes block 1 block 2 block 3 block 4 1 2 1 + 2 1 +2 2 b 1 b 2 b 1 +b 2 b 1 +2b 2 + 1 No ddionl storge!
Fult- Tolernce (toy exmple) Sme fult tolernce s RS code: cn tolerte filure of ny 2 nodes block 1 block 2 block 3 block 4 1 2 1 + 2 1 +2 2 b 1 b 2 b 1 +b 2 b 1 +2b 2 + 1
Fult- Tolernce (toy exmple) Sme fult tolernce s RS code: cn tolerte filure of ny 2 nodes block 1 block 2 block 3 block 4 1 2 1 + 2 1 +2 2 b 1 b 2 b 1 +b 2 b 1 +2b 2 + 1 1 2
Fult- Tolernce (toy exmple) Sme fult tolernce s RS code: cn tolerte filure of ny 2 nodes block 1 1 b 1 block 2 2 b 2 block 3 1 + 2 b 1 +b 2 block 4 1 +2 2 b 1 +2b 2 + 1 subtrct 1 2
Fult- Tolernce (toy exmple) Sme fult tolernce s RS code: cn tolerte filure of ny 2 nodes block 1 block 2 block 3 block 4 1 2 1 + 2 1 +2 2 b 1 b 2 b 1 +b 2 b 1 +2b 2 + 1 1 2 b 1 b 2
Recovery (toy exmple) Downlod & IO only 3 bytes (insted of 4 bytes s in RS) block 1 block 2 block 3 block 4 1 2 1 + 2 1 +2 2 b 1 b 2 b 1 +b 2 b 1 +2b 2 + 1
Recovery (toy exmple) Downlod & IO only 3 bytes (insted of 4 bytes s in RS) b 2 block 1 1 b 1 b 1 +b 2 block 2 2 b 2 b 1 +2b 2 + 1 block 3 1 + 2 b 1 +b 2 block 4 1 +2 2 b 1 +2b 2 + 1
Recovery (toy exmple) Downlod & IO only 3 bytes (insted of 4 bytes s in RS) b 2 subtrct block 1 1 b 1 b 1 +b 2 block 2 2 b 2 b 1 +2b 2 + 1 block 3 1 + 2 b 1 +b 2 block 4 1 +2 2 b 1 +2b 2 + 1
Recovery (toy exmple) Downlod & IO only 3 bytes (insted of 4 bytes s in RS) b 2 block 1 1 b 1 b 1 +b 2 block 2 block 3 2 1 + 2 b 2 b 1 +b 2 b 1 +2b 2 + 1 subtrct block 4 1 +2 2 b 1 +2b 2 + 1
Generl Piggybcking Recipe To construct Piggybcked- RS code: Step 1: Tke RS code with idencl prmeters Step 2: Add crefully designed funcons from one byte stripe on to nother retins sme fult- tolernce nd storge overhed piggybck funcons designed to reduce mount of downlod nd IO for recovery Generl theory nd lgorithms: KV Rshmi, Nihr Shh, K Rmchndrn, A Piggybcking Design Frmework for Red-nd Downlod-efficient Distributed Storge Codes, in IEEE Interntionl Symposium on Informtion Theory (ISIT) 2013
(10,4) Piggybcked- RS lternve to (10,4) RS currently used in HDFS
(10,4) Piggybcked- RS code Step 1: Tke (10, 4) Reed- Solomon code block 1 block 10 block 11 block 12 block 13 block 14 1 10 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 3 ( 1,, 10 ) f 4 ( 1,, 10 ) b 1 b 10 f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) f 3 (b 1,,b 10 ) f 4 (b 1,,b 10 ) 1 byte 1 byte
(10,4) Piggybcked- RS code Step 2: Add `Piggybcks block 1 block 10 block 11 block 12 block 13 block 14 1 10 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 3 ( 1,, 10 ) f 4 ( 1,, 10 ) b 1 b 10 f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 4 ( 1, 2, 3,0,,0) f 3 (b 1,,b 10 ) + f 4 (0,,0, 4, 5, 6,0,,0) f 4 (b 1,,b 10 ) + f 4 (0,,0, 7, 8, 9,0) 1 byte 1 byte
(10,4) Piggybcked- RS code Tolertes ny 4 block filures block 1 block 10 block 11 block 12 block 13 block 14 1 10 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 3 ( 1,, 10 ) f 4 ( 1,, 10 ) b 1 b 10 f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 4 ( 1, 2, 3,0,,0) f 3 (b 1,,b 10 ) + f 4 (0,,0, 4, 5, 6,0,,0) f 4 (b 1,,b 10 ) + f 4 (0,,0, 7, 8, 9,0)
(10,4) Piggybcked- RS code Tolertes ny 4 block filures block 1 block 10 block 11 block 12 block 13 block 14 1 10 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 3 ( 1,, 10 ) f 4 ( 1,, 10 ) b 1 b 10 f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 4 ( 1, 2, 3,0,,0) f 3 (b 1,,b 10 ) + f 4 (0,,0, 4, 5, 6,0,,0) f 4 (b 1,,b 10 ) + f 4 (0,,0, 7, 8, 9,0) recover 1,, 10 like in RS
(10,4) Piggybcked- RS code Tolertes ny 4 block filures block 1 block 10 block 11 block 12 block 13 block 14 1 10 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 3 ( 1,, 10 ) f 4 ( 1,, 10 ) b 1 b 10 f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 4 ( 1, 2, 3,0,,0) f 3 (b 1,,b 10 ) + f 4 (0,,0, 4, 5, 6,0,,0) f 4 (b 1,,b 10 ) + f 4 (0,,0, 7, 8, 9,0) recover 1,, 10 like in RS
(10,4) Piggybcked- RS code Tolertes ny 4 block filures block 1 block 10 block 11 block 12 block 13 block 14 recover 1,, 10 like in RS 1 10 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 3 ( 1,, 10 ) f 4 ( 1,, 10 ) b 1 b 10 f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 1 ( 1, 2, 3,0,,0) f 3 (b 1,,b 10 ) + f 1 (0,,0, 4, 5, 6,0,,0) subtrct piggybcks (funcons of 1,, 10 ) f 4 (b 1,,b 10 ) + f 1 (0,,0, 7, 8, 9,0)
(10,4) Piggybcked- RS code Tolertes ny 4 block filures block 1 block 10 block 11 block 12 block 13 block 14 1 10 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 3 ( 1,, 10 ) f 4 ( 1,, 10 ) b 1 b 10 f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 1 ( 1, 2, 3,0,,0) f 3 (b 1,,b 10 ) + f 1 (0,,0, 4, 5, 6,0,,0) f 4 (b 1,,b 10 ) + f 1 (0,,0, 7, 8, 9,0) recover 1,, 10 like in RS subtrct piggybcks (funcons of 1,, 10 ) recover b 1,,b 10 like in RS
(10,4) Piggybcked- RS code Efficient dt- recovery block 1 1 b 1 block 2 2 b 2 block 3 3 b 3 block 10 10 b 10 block 11 block 12 block 13 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 3 ( 1,, 10 ) f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 4 ( 1, 2, 3,0,,0) f 3 (b 1,,b 10 ) + f 4 (0,,0, 4, 5, 6,0,,0) block 14 f 4 ( 1,, 10 ) f 4 (b 1,,b 10 ) + f 4 (0,,0, 7, 8, 9,0)
(10,4) Piggybcked- RS code Efficient dt- recovery block 1 1 b 1 block 2 2 b 2 block 3 3 b 3 block 10 10 b 10 block 11 block 12 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 4 ( 1, 2, 3,0,,0)
(10,4) Piggybcked- RS code Efficient dt- recovery block 1 1 b 1 block 2 2 b 2 block 3 3 b 3 block 10 10 b 10 block 11 block 12 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 4 ( 1, 2, 3,0,,0) recover b 1,,b 10 like in RS
(10,4) Piggybcked- RS code Efficient dt- recovery block 1 1 b 1 block 2 2 b 2 block 3 3 b 3 block 10 10 b 10 block 11 block 12 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 4 ( 1, 2, 3,0,,0) recover b 1,,b 10 like in RS
(10,4) Piggybcked- RS code Efficient dt- recovery block 1 1 b 1 block 2 2 b 2 block 3 3 b 3 block 10 10 b 10 block 11 block 12 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 4 ( 1, 2, 3,0,,0) recover b 1,,b 10 like in RS subtrct f 2 (b 1,,b 10 )
(10,4) Piggybcked- RS code Efficient dt- recovery block 1 1 b 1 block 2 2 b 2 block 3 3 b 3 block 10 10 b 10 block 11 block 12 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 4 ( 1, 2, 3,0,,0) recover b 1,,b 10 like in RS subtrct f 2 (b 1,,b 10 )
(10,4) Piggybcked- RS code Efficient dt- recovery block 1 1 b 1 block 2 2 b 2 block 3 3 b 3 block 10 10 b 10 block 11 block 12 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 4 ( 1, 2, 3,0,,0) recover b 1,,b 10 like in RS subtrct f 2 (b 1,,b 10 ) remove effect of 2 nd 3 to get 1
(10,4) Piggybcked- RS code block 1 1 b 1 block 2 2 b 2 Downlod & IO: block 3 3 b 3 block 10 20 in RS 10 b 10 13 in Piggybcked- RS block 11 block 12 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 1 ( 1, 2, 3,0,,0) recover b 1,,b 10 like in RS subtrct f 2 (b 1,,b 10 ) remove effect of 2 nd 3 to get 1
(10,4) Piggybcked- RS code Efficient dt- recovery block 1 block 10 block 11 block 12 block 13 block 14 1 10 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 3 ( 1,, 10 ) f 4 ( 1,, 10 ) b 1 b 10 f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 4 ( 1, 2, 3,0,,0) f 3 (b 1,,b 10 ) + f 4 (0,,0, 4, 5, 6,0,,0) f 4 (b 1,,b 10 ) + f 4 (0,,0, 7, 8, 9,0) Repir of blocks 1,2,3
(10,4) Piggybcked- RS code Efficient dt- recovery block 1 block 10 block 11 block 12 block 13 block 14 1 10 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 3 ( 1,, 10 ) f 4 ( 1,, 10 ) b 1 b 10 f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 4 ( 1, 2, 3,0,,0) f 3 (b 1,,b 10 ) + f 4 (0,,0, 4, 5, 6,0,,0) f 4 (b 1,,b 10 ) + f 4 (0,,0, 7, 8, 9,0) Repir of blocks 4,5,6
(10,4) Piggybcked- RS code Efficient dt- recovery block 1 block 10 block 11 block 12 block 13 block 14 1 10 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 3 ( 1,, 10 ) f 4 ( 1,, 10 ) b 1 b 10 f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 4 ( 1, 2, 3,0,,0) f 3 (b 1,,b 10 ) + f 4 (0,,0, 4, 5, 6,0,,0) f 4 (b 1,,b 10 ) + f 4 (0,,0, 7, 8, 9,0) Repir of blocks 7,8,9
(10,4) Piggybcked- RS code Efficient dt- recovery block 1 block 10 block 11 block 12 block 13 block 14 1 10 f 1 ( 1,, 10 ) f 2 ( 1,, 10 ) f 3 ( 1,, 10 ) f 4 ( 1,, 10 ) b 1 b 10 f 1 (b 1,,b 10 ) f 2 (b 1,,b 10 ) + f 4 ( 1, 2, 3,0,,0) f 3 (b 1,,b 10 ) + f 4 (0,,0, 4, 5, 6,0,,0) f 4 (b 1,,b 10 ) + f 4 (0,,0, 7, 8, 9,0) Repir of block 10
Expected Performnce Storge efficiency nd relibility no ddionl storge vs RS sme fult- tolernce vs RS
Expected Performnce Storge efficiency nd relibility no ddionl storge vs RS sme fult- tolernce vs RS Reduced recovery downlod & disk IO 30% less for single block recoveries in stripe potenl reducon >50TB cross- rck trffic per dy
Expected Performnce Storge efficiency nd relibility no ddionl storge vs RS sme fult- tolernce vs RS Reduced recovery downlod & disk IO 30% less for single block recoveries in stripe potenl reducon >50TB cross- rck trffic per dy Recovery me: expect fster recovery need to connect to more nodes system limited by disk nd network bndwidth corroborted by preliminry experiments hence, expect higher MTTDL
Relted Work: Mesurements Exisng Studies Avilbility studies: Schroeder & Gibson 2007, Jing et l 2008, Ford et l 2010 etc Comprisons between replicon nd ersure codes: Rodrigues & Liskov 2005, Wetherspoon & Kubitowicz 2002 etc Our focus Incresed network trffic due to incresed downlods during recovery of ersure- coded dt Mesurements from Fcebook wrehouse cluster in producon
Relted Work: Codes for Efficient Dt Recovery Hung et l (Windows Azure) 2012, Sthimoorthy et l (Xorbs) 2013 dd ddionl pries: need extr storge Hu et l (NCFS) 2011 Network file system using repir- by- trnsfer codes (Shh et l): need extr storge Khn et l (Rotted- RS) 2012 #prity 3 (lso, #dt 36) Xing et l, Wng et l (Opmized RDP & EVENODD) 2010 #prity <=2 Our solu;on: Piggybcked- RS no ddionl storge: storge- cpcity opml ny #dt & #prity s good s or bezer thn Rotted- RS, opmized RDP & EVENODD
Summry nd Future Work Ersure codes require higher downlod & IO for recovery Mesurements from Fcebook wrehouse cluster in producon Piggybcked- RS: lternve to RS no ddionl storge required; sme fult- tolernce s RS 30% reducon in downlod & disk IO for recovery Future Work implementon in HDFS (in progress t UC Berkeley) empiricl evluon