Functional-Repair-by-Transfer Regenerating Codes

Functional-Repair-by-Transfer Regenerating Codes Kenneth W Shum and Yuchong Hu Abstract In a distributed storage system a data file is distributed to several storage nodes such that the original file can be decoded from any subset of the storage nodes of size larger than or equao a certain threshold Upon the failure of a storage node we would like to regenerate it with minimal amount of data transmissions from the surviving nodes to the new node This performance metric is called the repair-bandwidth Another performance metric is the disk input/output (I/O) cost which measures the number of bits a storage node needs to read out from its memory in order to repair the failed node In this paper we give examples of linear regenerating codes with minimal disk I/O cost and repair-bandwidth without any linear mixing in the helping storage nodes Index Terms Cloud storage distributed storage system regenerating codes network coding I INTRODUCTION Regenerating codes as introduced by Dimakis et al in [2] is a class of erasure codes for distributed storage systems A source data file is encoded across the storage nodes such that a data collector can decode the source data file by connecting to a fraction of the storage nodes Should a storage node fail the failed node can be repaired by downloading some data packets from the surviving nodes The aim of the design of regenerating codes is to minimize the the totaraffic required in the repair process Besides the bandwidth requirement another metric that arises in practice is the disk I/O cost: we want to minimize the number of bits that a surviving node must read out from its memory during the repair of the failed node In the extreme case where disk I/O cost is minimal the number of bits read out from the memory is exactly equao the number of bits to be sent out Data combining is only required in the receiving end Regenerating code with minimal disk I/O cost is called a repair-by-transfer regenerating code Some constructions of repair-by-transfer regenerating codes attaining minimum repair-bandwidth can be found in [3] [4] In this paper we will consider minimum-storage regenerating codes This class of regenerating codes find applications in multiplecloud storages [5] We give an example of repair-by-transfer regenerating code which has the same parameters as in the example in [2 Fig 2] but linear combining is not required in the surviving storage nodes The encoding of packets are shown in Fig 1 The first Part of this work was presented in Information Theory and Applications Workshop San Diego Feb 2012 [1] The work of K W Shum and Y Hu was partially supported by a grant from the University Grants Committee (Project No AoE/E-02/08) of the Hong Kong Special Administrative Region China K W Shum and Y Hu are with Institute of Network Coding The Chinese University of Hong Kong Shatin Hong Kong Email: {wkshumychu}@inccuhkeduhk Fig 1 1st packet 2nd packet Node 1 A C + D Node 2 B D + A Node 3 C A + B Node 4 D B + C An example of exact regenerating codes with repair-by-transfer packet in each storage node is an uncoded packet the second one is the sum of two packets with addition performed in F 2 the finite field of size 2 Let x y denote the unique integer z {1 2 3 4} such that x + y i mod 4 The parity-check packets in nodes i 2 is the sum of the uncoded packets in node i and node i 1 One can verify that the four information packets can be decoded from any two nodes For example from nodes 1 and 3 packets A and C can be obtained directly and packet B (resp D) can be decoded by adding packet A (resp C) to A+B (resp C +D) If node 1 fails we can repair it by downloading packets D +A from node 2 packet C from node 3 and packet D from node 4 Packet A can be recovered by adding packets D + A and D and packet C + D can be recovered by adding packets C and D By the cyclic structure of the code nodes 2 3 and 4 can be repaired similarly We also note that data update can be performed efficiently; should an information packet be updated we only need to modify one uncoded packet and two parity-check packets in the system The above example falls under the category of exact repair In this paper we consider functional repair Regenerating codes for functional repair in general is given in Section II and the repair-by-transfer subclass is discussed in Section III A family of repair-by-transfer regenerating codes for functional repair is detailed in Section IV II NETWORK CODES FOR DISTRIBUTED STORAGE Let n be the number of storage nodes A data file of size M is encoded and distributed to the n storage nodes The amount of data stored in each node is denoted by α We divide the time into stages Initially the system start at stage 0 After a node failure in stage t the failed node is repaired and the time is advanced to the (t + 1)-st stage The design objectives of regenerating code for distributed storage network (DSN) are: (Node repair) Upon the failure of a storage node in stage t we recover it by setting up a newcomer who contacts any d surviving nodes called the helpers and downloads β units of data from each of them The total amount of data transmitted from the helpers is called the repair-bandwidth and is denoted by γ = dβ (File retrieval) In any stage a data collector can decode the

original data file by downloading from any k storage nodes We calhis property the (n k) recovery property If α is equal to M/k which is the minimum possible value for α then the system is said to be maximal-distance separable (MDS) Any code which realizes the (n k) recovery property and repairs a failed node by connecting to d surviving nodes is called an (n d k) regenerating code or simply a regenerating code if the parameters are understood from the context There are two modes of repair The first one is called functional repair and the second one is exact repair In functional repair the content of the newcomer is not necessarily the same as the content in the failed storage nodes Only the (n k) recovery property is preserved In exact repair the content of the newcomer should be exactly the same as in the failed node For a given α there is a fundamental limit for the repair cost measured in terms of the repair-bandwidth For functional repair a lower bound of repair-bandwidth can be derived via the max-flow bound for single-source multicasting [2] In the MDS case the lower bound on repair-bandwidth is given by γ M k(d + 1 k) (1) It is proved in [6] that for functional repair the lower bound of repair-bandwidth in (1) can be achieved by linear network code over a fixed finite field In the following we describe the realization of regenerating codes by linear network codes Let q be a prime power and let the finite field of size q be denoted by F q The parameters B α and β of the DSN are all integers In the following we will refer to an element in F q as a packet as well The file is divided into many stripes of data each consisting of B packets Each stripe of data will be encoded in the same way In the remaining of this paper we will focus on one stripe of data and suppose that the data file consists of B packets m 1 m 2 m B These B symbols are called the message symbols Each of the n storage nodes stores α packets obtained by taking some linear combinations of the message symbols The vector formed by the B coefficients of the linear combination associated to a packet is called the global encoding vector of the packet In the t-th stage we denote the global encoding vector of the j-th packet stored in the i-th node by Γ t (i j) for i = 1 2 n and j = 1 2 α Thus the j-th packet in the i-th node is equao the dot product of Γ t (i j) and [m 1 m 2 m B ] We call (i j) the index of the vector Γ t (i j) We will view Γ t (i j) as a function mapping an index (i j) to a global encoding vector in stage t The definition of Γ t is extended as a function with sets of indices as domain by defining Γ t (X ) := {Γ t (x) : x X } for any set of indices X Let i be the α B encoding matrix whose rows are precisely the global encoding vectors Γ t (i 1) Γ t (i α) and m be the column vector [m 1 m 2 m B ] T (We will use superscript (t) to indicate that a variable associated with stage t and superscript T for the transpose operator) The packets stored within the i-th node in the t-th stage are the components in i m The DSN is initialized such that at stage 0 the global encoding vectors in any k storage nodes form a full-rank matrix Suppose that node i fails in stage t and the helpers are nodes h 2 For j = 1 2 d helper h j sends β packets to the newcomer Let be the β α local encoding matrix of helper at stage t The packets sent from helper to the newcomer are the components in m The newcomer takes some linear combinations of the received packets and compute α packets by H (t) i m where H (t) i is an α (dβ) matrix over F q After the repair process in stage t the global encoding matrices can be updated by j if j i j = H (t) i if j = i If we want to repair the DSN exactly the encoding matrix j should equal j for all j For functional repair we just want to maintain the (n k) recovery property namely at any stage and in any set of k storage nodes the rank of the kα global encoding vectors is equao B III REPAIR-BY-TRANSFER FOR FUNCTIONAL REPAIR In a repair-by-transfer regenerating code we choose the local encoding matrix h j such that there is exactly one 1 in each row while the other entries are all zero In the remaining of this paper we will focus on (n n 1 2) repair-by-transfer regenerating codes for functional repair with parameters d = n 1 k = 2 B = k(d + 1 k) = 2(n 2) α = B/k = n 2 and β = 1 ie any two storage nodes are sufficient in rebuilding the original file and each node stores B/2 packets (the MDS case) Any regenerating code with the above parameters is optimal with respect to the bound in (1) For repair-by-transfer regenerating code with the above parameters the local encoding matrix j is a zero-one row vector If node fails in stage t for j we let χ(t j) be the index of the unique 1 in ie node j sends j

packet χ(t j) to node The function χ(t j) is called the choice function and takes value between 1 and α Let R t ( ) := {(j χ(t j) : j {1 2 n} \ { }} be the set of indices of the global encoding vectors used in the repair of node For example in Fig 1 if node 1 fails we can repair node 1 using packets C A + C and B + C from nodes 2 3 and 4 respectively The choice function is χ(t 1 2) = 1 χ(t 1 3) = 1 and χ(t 1 4) = 2 and we have R t (4) = {(2 1) (3 1) (4 2)} In case node fails in stage t the repair-by-transfer process consists of the following steps: (i) For j node j sends the χ(t j)-th packet in its memory to the newcomer (ii) Put the packets received by the newcomer together as a column vector and multiply it by an α (n 1) local encoding matrix H (t) Update the global encoding matrix of the newly reconstructed node to = H (t) Γ t (1 χ(t 1)) Γ t ( 1 χ(t 1)) Γ t ( + 1 χ(t + 1)) (2) Γ t (n χ(t n)) (iii) Determine the choice function for the repair in the next stage χ(t + 1 i j ) for i and j {1 2 α} \ {i } We introduce some notations Let and C i := {(i 1) (i 2) (i α)} C ij := C i C j The (n 2) property requires that for any t the global encoding vectors indexed by C ij with i j in stage t are linearly independent Example 1 Consider a (5 4 2) regenerating code In stage t the global encoding vectors can be tabulated as Node 1 Γ t (1 1) Γ t (1 2) Γ t (1 3) Node 2 Γ t (2 1) Γ t (2 2) Γ t (2 3) Node 3 Γ t (3 1) Γ t (3 2) Γ t (3 3) Node 4 Γ t (4 1) Γ t (4 2) Γ t (4 3) Node 5 Γ t (5 1) Γ t (5 2) Γ t (5 3) Suppose that node 1 fails in stage t and we want to repair node 1 by the first packets Γ t (2 1) Γ t (3 1) Γ t (4 1) and Γ t (5 1) in node 2 to node 5 respectively In order to maintain the (n 2) recovery property in stage t + 1 we need to guarantee that after the repair each of the five sets of global encoding vectors Γ t+1 (C 12 ) Γ t+1 (C 13 ) Γ t+1 (C 14 ) and Γ t+1 (C 15 ) are linearly independent For integer j between 1 and n let S t ( j) := C j R t ( ) The set S t ( j) is interpreted as the index set of the packets of node j and the packets used in repairing node in stage t Continued from Example 1 we have S t (1 2) = {(2 1) (2 2) (2 3) (3 1) (4 1) (5 1)} S t (1 3) = {(3 1) (3 2) (3 3) (2 1) (4 1) (5 1)} S t (1 4) = {(4 1) (4 2) (4 3) (2 1) (3 1) (5 1)} S t (1 5) = {(5 1) (5 2) (5 3) (2 1) (3 1) (4 1)} Graphically these four sets of indices can be illustrated by the following arrays: We note that for j the set S t ( j) consists of 2(n 2) elements because C j = n 2 R t ( ) = n 1 and C j R t ( ) = 1 When = j S t ( j) = 2n 3 Suppose that the (n 2) recovery property holds in stage t In order to maintain the (n 2) recovery property after the repair of node we need to check that each of the following sets of global encoding vectors Γ t+1 (C ltj) for j {1 2 n} \ { } are linearly independent (We do not need to check Γ t+1 (C jj ) for j j because Γ t (C jj ) = Γ t+1 (C jj )) We make the following observations for j : (i) Each vector in Γ t+1 (C lt j) can be obtained as a linear combination of the vectors in Γ t (S t ( j)) Therefore if the vectors in Γ t (S t ( j)) are linearly dependent then the vectors in Γ t+1 (C ltj) are also linearly dependent (ii) If the vectors in Γ t (S t ( j)) are linearly independent then we can choose a local encoding matrix H (t) in stage t such that the vectors in Γ t+1 (C lt j) are linearly independent In fact we can use a (n 2) (n 1) matrix H whose (r s)-entry is { 1 s = r or s = r + 1 H(r s) = 0 otherwise as the linearly-combining matrix H (t) at the newcomer in all stage t and for all node [ being ] repaired For example if n = 1100 5 we have H (t) i = 0110 This matrix works because after 0011 eliminating any column matrix the resulting square matrix has full rank This proves that for each j we can choose H (t) such that a newcomer in the next stage can decode the original data file from the packets in nodes j and A standard application of the Schwartz-Zippel s lemma in network coding theory [7] shows that there is a choice of H (t) such that all data collectors in the next stage are satisfied if the finite field size is large enough This gives the following lemma

Lemma 1 Suppose that node fails in stage t and the choice function χ is given If for each j the set of vectors in Γ t (S t ( j)) are linearly independent then we can choose H (t) such that the vectors in Γ t+1 (C ltj) are linearly independent for all j provided that the underlying finite field is sufficiently large The intuition from Lemma 1 is as follows The (n 2) recovery property per se requires that the global encoding vectors in any two storage nodes are linearly independent However in order to sustain the (n 2) recovery property through repair-bytransfer we need to ensure the linear independence of many other sets of global encoding vectors The next lemma serves as a tool for this purpose Recalhat for a set X of indices the notation Γ t (X ) stands for the set of global encoding vector corresponding to the indices in X in stage t Lemma 2 Suppose that node fails in stage t and packets with indices R t ( ) are used to repaired node Let X and Y be two sets of indices of global encoding vectors in stage t and t + 1 respectively such that Y \ X C lt and X \ Y R t ( ) If the set of vectors in Γ t (X ) are linearly independent then we can choose the local encoding matrix in stage t such that the vectors in Γ t+1 (Y) are linearly independent The proof of Lemma 2 is omitted IV MAIN THEOREM The choice function χ has to be carefully chosen otherwise the (n 2) recovery property cannot be preserved for unbounded number of stages As a counter-example suppose that in Example 1 we always pick the first packets in the memory of the surviving nodes in the repair process ie the function χ(t j) is identically equao 1 in all stages Then the dimension of the span of totality of encoding vectors in all nodes will eventually drop to four or less As the dimension of the original data file is six the data recovery will doom to failure The main theorem in this paper is that there is an appropriate choice function which preserve the (n 2) recovery property through repair-by-transfer Theorem 3 For all n 4 there exists an (n n 1 2) MDS repair-by-transfer regenerating code for functional repair meeting the lower bound on repair-bandwidth in (1) Moreover the finite field size can be chosen to be any prime power larger than ( n(n 2) 2(n 2)) We will only sketch the proof of Theorem 3 in this section and go through an example for illustration The idea of proof is to come up with a choice function χ(t i j) and local encoding matrix H (t) i such that there are sufficiently many sets of linearly independent global encoding vectors in every stage We let B = k(d+1 k) = 2(n 2) α = n 2 and β = 1 by normalizing the unit if necessary In the proof of Theorem 3 we assume that no node will fail in two consecutive stages because say if node i fails in stage t and t + 1 we can repeat the repair procedure taken in stage t It is well-known that under the condition q + 1 n(n 2) we can construct n(n 2) vectors of length B with components drawn from F q such that every subset of B vectors are linearly independent over F q [8] (These vectors form a generating matrix of a maximal-distance separable (MDS) code of dimension B over F q ) In stage 0 we take these n(n 2) vectors as the initial global encoding vectors Since any set of B of them are linear independent by construction the global encoding vectors in any two storage nodes are linear independent Thus the (n 2) recover property holds in the initial stage We will display a choice function by a sequence of n (n 2) arrays For each t 0 we create an n (n 2) array and write i in the r-th row and the χ(t i r)-th column The array associated with stage t is interpreted as follows If node i fails in stage t for r i node r sends the j-th packet to node i if the j-th column in the array in stage t is labeled by i An example for n = 5 is shown in Fig 2 In stage 0 we initialize χ(0 i j) = 1 for all i and j ie no matter which node fails the surviving nodes send their first packets to the newcomer In the r-th row we write the element in {1 2 5} \ {r} in the column labeled by Packet 1 Suppose that node 1 fails in stage 0 ie l 0 = 1 The newcomer receives packets with global encoding vector Γ 0 (2 1) Γ 0 (3 1) Γ 0 (4 1) and Γ 0 (5 1) The indices of these four packets are R 0 (1) = {(2 1) (3 1) (4 1) (5 1)} The packets in the repaired node 1 are linear combinations of these four packets Then we specify the choice function for stage t = 1 From the hypothesis that no node fails in two consecutive stages node 1 will not fail in stage 1 The 5 3 array associated with stage 1 does not contain the label 1 We design the choice function by evenly spreading the labels 2 to 5 in the array In rows 2 to 5 each entry in the array contains exactly one label Each packet is potentially used in the repair process after stage 1 depending on which node fails in stage 1 This evenly spread property is expressed mathematically as Property (a): For j 1 χ(t i j) = χ(t i j) only if i = i From the discussion at the beginning of this section we try to avoid using the same packet in the repair process of two consecutive stages This motivates the second heuristics in the design of the choice function Since the first packets in nodes 2 to 5 have been used in the repair process before stage 1 we want to impose the requirement that at most one of these four packets are used again in the repair process after stage 1 As there are four possible node failures in stage 1 this implies that exactly one of these four packets is used in the repair of the node failure in stage 1 Indeed the choice function at stage 1 satisfies this requirement The four entries under Packet 1 corresponding to nodes 2 to 5 are distinct Suppose node 2 fails in stage 1 Node 1 and 5 send the first packets in their memory to the newcomer while nodes 3 and 4 send the last packets in their memory to the newcomer The indices of these four packets are R 1 (2) = {(1 1) (3 3) (4 3) (5 1)} We note that only the first packet in node 5 is used twice in the repair after stage 0 and stage 1 In

Stage 0 Node 1 2345 Node 2 1345 Node 3 1245 Node 4 1235 Node 5 1234 Stage 1 25 3 4 3 4 5 4 5 2 5 3 2 2 3 4 Stage 2 3 4 5 1 3 45 5 4 1 1 3 5 4 1 3 Stage 3 2 5 3 3 5 1 5 2 1 12 3 5 3 1 2 Fig 2 An example of choice functions satisfying property (a) and (b) Node 1 fails in stage 0 node 2 fails in stage 1 and node 4 fails in stage 2 The integers 1 2 and 4 are highlighted (l 0 = 1 l 1 = 2 and l 2 = 4) set-theoreticaerms we can write R 0 (1) R 1 (2) = {(5 1)} The choice function for t = 2 is illustrated in the array under Stage 2 in Fig 2 We note that the choice function in stage 2 satisfies Property (a) ie for r {1 3 4 5} each entry in row r contains exactly one label Also the label 2 does not appear because it is assumed that node 2 will not fail in stage 2 We can check that R 1 (2) R 2 (1) = {(3 3)} R 1 (2) R 2 (3) = {(1 1)} R 1 (2) R 2 (4) = {(5 1)} R 1 (2) R 2 (5) = {(4 3)} We formulate this heuristic as follows Property (b) R t ( ) R t+1 (i) 1 for t 0 and i We write X t Y if Y and X satisfy the condition in Lemma 2 Under the choice function as in Fig 2 and l 0 = 1 l 1 = 2 we illustrate the proof idea using the following figure: Stage 0 01 Stage 1 12 Stage 2 The asterisks in each array indicate a set of six global encoding vectors In stage 2 it is required that the global encoding vectors in nodes 1 and 2 are linearly independent By Lemma 1 this can be guaranteed if the global encoding vectors S 1 (2 1) = {(1 1) (1 2) (1 3) (3 3) (4 3) (5 1)} in stage 1 are linearly independent and this in turns can be done as in Lemma 2 if the global encoding vectors with indices {(2 1) (3 1) (3 3) (4 1) (4 3) (5 1)} in stage 0 are linearly independent We initialize the global encoding vectors in stage 0 in such a way that any six global encoding vectors are linearly independent Hence in stage 2 with suitable choice of local encoding matrices the six global encoding vectors in nodes 1 and 2 are linearly independent The main idea of proof is that the above example can be generalized to all C ij in all stages provided that the choice function satisfies Properties (a) and (b) and some other technical conditions The details of the proof is omitted due to space limitation V CONCLUDING REMARKS We give an existence proof of a family of MDS regenerating codes with the property that failed node is repaired by transferring packets received in earlier time This can be regarded as a linear code with coefficients restricted to {0 1} In a recent work by Wang et al in [9] a construction of MDS exactrepair regenerating code with optimal disk I/O cost for any k is given Their construction differs from the construction in this paper as being a vector linear regenerating code; the data sent from a surviving node to the newcomer is divided into many fragments and each fragment is a linear combination of the source data (β > 1) The regenerating code considered in this paper is scalar linear regenerating codes (β = 1) Another vector linear regenerating code for repairing the systematic nodes only can be found in [10] ACKNOWLEDGEMENT The authors would like to thank Raymond Yeung for the useful discussions REFERENCES [1] K W Shum and Y Hu Repair-by-transfer in distributed storage system in Information Theory and Applications Workshop San Diego Feb 2012 [2] A G Dimakis P B Godfrey Y Wu M J Wainwright and K Ramchandran Network coding for distributed storage systems IEEE Trans Inf Theory vol 56 no 9 pp 4539 4551 Sep 2010 [3] S El Rouayheb and K Ramchandran Fractional repetition codes for repair in distributed storage systems in Allerton conference on commun control and computing Monticello Sep 2010 [4] N B Shah K V Rashmi P V Kumar and K Ramchandran Distributed storage codes with repair-by-transfer and non-achievability of interior points on the storage-bandwidth tradeoff IEEE Trans Inf Theory vol 58 no 3 pp 1837 1852 Mar 2012 [5] Y Hu H C H Chen P P C Lee and Y Tang NCCloud: Applying network coding for the storage repair in a cloud-of-clouds in Proc of the 10th USENIX Conf on File and Storage Tech (FAST 12) San Jose Feb 2012 [6] Y Wu Existence and construction of capacity-achieving network codes for distributed storage IEEE J on Selected Areas in Commun vol 28 no 2 pp 277 288 Feb 2010 [7] R Kötter and M Médard An algebraic approach to network coding IEEE/ACM Trans on Networking vol 11 no 5 pp 782 795 Oct 2003 [8] R M Roth Introduction to Coding Theory Cambrdige: Cambridge University Press 2006 [9] Z Wang I Tamo and J Bruck On codes for optimal rebuilding access in Proc of the Allerton Conference Monticello Sep 2011 pp 1374 1381 [10] V R Cadambe C Huang J Li and S Mehrotra Polynomial length MDS codes with optimal repair in distributed storage in the 45th Asilomar Conf on Signals Systems and Computers Pacific Grove Nov 2011 pp 1850 1854