Reliability Comparison of Various Regenerating Codes for Cloud Services Yonsei Univ. Seoul, KORE Jung-Hyun Kim, Jin Soo Park, Ki-Hyeon Park, Inseon Kim, Mi-Young Nam, and Hong-Yeop Song ICTC 13, Oct. 14-16, 2013
Contents Introduction Why we need regenerating codes for clouds? How to regenerating failed nodes? ackground of Regenerating Codes Regenerating code framework Tradeoff between storage size and repair bandwidth Various Regenerating Codes Minimum Storage Regenerating (MSR) codes Minimum andwidth Regenerating (MR) codes Local Reconstruction Codes (LRC) LT Regenerating codes Simulation Results Conclusion 2
Introduction Why we need regenerating codes for clouds? To repair node failure t Facebook, it is quite typical to have 20 or more node failures per day. Number of failed nodes over a single month period in 3000 nodes M. Sathiamoorthy, M. steris, D. Papailiopoulos,. G. Dimakis, R. Vadali, S. Chen, D. orthakur, XORing Elephants: Novel Erasure Codes for ig Data, in Proc. of the 39th International Conf. on Very Large Data ases, 2013. 3
Introduction How to regenerate failed nodes? Node repair using codes for erasure channel Node failure repetition code Node failure MDS code + +2 4
Introduction How to regenerate failed nodes? MDS codes have higher reliability than repetition codes Node failure? Repair failure repetition code? Node failure MDS code + + +2 5
ackground of Regenerating codes Regenerating Codes Framework M n : # of storage nodes k : # of storage nodes for data collection α:storage size M : data size d : # of storage nodes for node repair (read cost) β:download size dβ : repair bandwidth 6
ackground of Regenerating codes Reducing storage size and repair bandwidth ased on the min-cut bound : 8000 7000 Storage-repair bandwidth trade-off curve Minimum Storage Regenerating (MSR) point min {,} M MSR codes 6000 Repair bandwidth d 5000 4000 3000 Minimum andwidth Regenerating (MR) point MR codes 2000 1000 1100 1200 1300 1400 1500 1600 1700 1800 Storage size per node < Tradeoff between storage size and repair bandwidth (M=7000, n=15, k=7, d=7) > N.. Shah, K. V. Rashmi, P. V. Kumar, and K. Ramchandran, Distributed Storage Codes With Repair-by-Transfer and Nonachievability of Interior Points on the Storage-andwidth Tradeoff, IEEE Trans. Inf. Theory, vol. 58, no. 3, pp. 1837 1852, Mar. 2012. 7
ackground of Regenerating codes Reducing repair read cost Repair read cost : the minimum number of nodes for repair Repair read cost : 1 Repair read cost : 2 + +2 repetition code MDS code Efficient Repair Low Reliability High Reliability Inefficient Repair Q. Tradeoff between repair read cost and reliability? 8
Various Regenerating codes Regenerating codes Tradeoff storage size and repair bandwidth Repair read cost Rateless codes (Good performance for erasure channel) Minimum Storage Regenerating (MSR) Codes Minimum andwidth Regenerating (MR) Codes Local Reconstruction Codes (LRC) LT Reconstruction Codes 9
Various Regenerating codes Minimum Storage Regenerating (MSR) codes Using a Maximum Distance Separable (MDS) code < (8, 4) MDS code > Code construction methods Interference lignment method, Product-Matrix method, etc. N.. Shah, K. V. Rashmi, P. V. Kumar, and K. Ramchandran, Interference lignment in Regenerating Codes for Distributed Storage: Necessity and Code Constructions, IEEE Trans. Inf. Theory, vol. 58, no. 4, pp. 2134 2158, pril 2012. K. V. Rashmi, N.. Shah, and P. V. Kumar, Optimal exact-regenerating codes for the MSR and MR points via a product-matrix construction, IEEE Trans. Inf. Theory, vol. 57, no. 8, pp. 5227-5239, ug. 2011. 10
Various Regenerating codes Minimum andwidth Regenerating (MR) codes Using a Fractional Repetition (FR) code < (8, 4) FR code > Code construction methods Repair-by-product method, Product-Matrix method, etc. K. W. Shum, and Y. Hu, Functional-Repair-by-Transfer Regenerating Codes, in Proc. of 2012 IEEE International Symposium on Information Theory, Cambridge, M, July 2012. K. V. Rashmi, N.. Shah, and P. V. Kumar, Optimal exact-regenerating codes for the MSR and MR points via a product-matrix construction, IEEE Trans. Inf. Theory, vol. 57, no. 8, pp. 5227-5239, ug. 2011. 11
Various Regenerating codes Local Reconstruction Codes (LRC) Extending an MDS code < (8, 2, 2) LRC > Cheng Huang, Minghua Chen, and Jin Li. Pyramid codes: flexible schemes to trade space for access efficiency in reliable data storage systems, In Sixth IEEE International Symposium on Network Computing and pplications (NC 2007), pp. 79-86, 2007. 12
Various Regenerating codes Local Reconstruction Codes (LRC) Repair read cost comparison between MSR code and LRC MSR code repair read cost = 8 LRC repair read cost = 4 13
Various Regenerating codes LT Regenerating Codes Using the ideal/robust soliton distribution < (8, 4) LT regenerating code > : degree distribution < (8, 4) LT regenerating code > M. steris and. G. Dimakis, Repairable fountain codes, in Proc. of 2012 IEEE International Symposium on Information Theory, Cambridge, M, July 2012. 14
Simulation Results etter cost and overhead trade-off E T T E R (d) Repair read cost 14 12 10 8 6 4 LT MSR RS LRC MR REP Storage overhead : the ratio of all storage nodes, n, to storage nodes for data collection, k Repair read cost : the number of storage nodes for node repair 2 0 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 Storage overhead (n/k) etter 15
Simulation Results Repair failure probability for different node failure probability 10 0 10-1 Node failure prob. : the probability that a node is unavailable E T T E R Repair failure probability 10-2 10-3 10-4 MR REP (7,8) LT(7,8) LRC(7,3,5) MSR RS (7,8) 10-5 0.15 0.2 0.25 0.3 0.35 0.4 Node failure probability () Repair failure prob. : the probability that any newcomer nodes can not repair the original data symbol from coded data symbols of surviving storage nodes 16
Conclusion Through the trade-off between repair read cost and storage overhead, we can expect that the optimal coding scheme might be different according to system requirements. lthough LRC is not an MDS code, it achieves both low repair read cost and low storage overhead by relaxing MDS property. Hence LRC can be a good candidate for practical systems and it should be studied more as a future coding scheme for cloud services. 17