With the rapid development of computer network, the data size of network information becomes bigger and bigger, which makes the storage of magnanimity data particularly important. The traditional file storage system can no longer gratify the demand of high-capacity, high-reliability, high-performance etc. of the existing applications, the distributed storage system proved to be effective because of its efficient extendibility and high availability. However, the storage nodes in distributed storage system are unreliable. We need to introduce redundancy to the system for the sake of providing reliable service with unreliable storage nodes. The simplest redundancy is to make a direct back-up for the original data. Although direct back-up is simple, it has a low storage efficiency and system reliability, while we can improve the storage efficiency with a coding method to introduce redundancy. In the present storage system, MDS (Maximum Distance Separable) codes are generally adopted, which can reach the optimum efficiency of storage space. A (n,k) MDS error-correcting code splits the original file into k pieces of equal size, and generates n independent encoded pieces through linear coding, which are stored in n storage nodes respectively, to satisfy the MDS property (any k encoded parts are sufficient to reconstruct the original file). This coding technique occupies an important position to provide an effective network storage redundancy, particularly is appropriate for the storage of large files and the application of data back up.
In distributed storage system, a file of size B is stored in n storage nodes, each with size a. The data receiver only needs to connect to and download the data stored in arbitrary k out of n storage nodes to recover the original data B. This procedure is called reconstruction. RS (Reed-Solomon) code is a code satisfying MDS property. When a storage node is failed, we need to regenerate a new node to store the lost data in order to keep the redundancy level. This procedure is called repair process. However, during repair process, RS code firstly needs to reconstruct the original file with data of k storage nodes, then recover the lost data of the filed node for the new node. Obviously it's a waste of bandwidth to recover the whole file for the sake of recovering one node.
The failure of storage nodes and the loss of files will result in the decrease of redundancy gradually with time, therefore, we need a mechanism to ensure system redundancy. EC (Erasure Codes) proposed in the paper [R. Rodrigues and B. Liskov, “High Availability in DHTs: Erasure Coding vs. Replication”, Workshop on Peer-to-Peer Systems (IPTPS) 2005.] is relatively effective on storage overhead, while it needs a rather bigger communication overhead to sustain the recovery of the redundancy. FIG. 1 shows that we can obtain the original file as long as there are d (d≧k) effective nodes in the system. FIG. 2 shows the process to recover the stored file in the failed node. As can be seen from FIG. 1 and FIG. 2, the entire recovery process is: 1) download data from k storage nodes to reconstruct the original file; 2) recode the original file to generate the new fragment and store it in the new node. The recovery process shows that the required network load to repair any failed node is at least the contents stored in k nodes.
Meanwhile, in order to reduce the bandwidth used in the repair process, the paper [A. G. Dimakis, P. G. Godfrey, M. J. Wainwright, K. Ramchandran, “Network coding for distributed storage systems”, IEEE Proc. INFOCOM, Anchorage, Ak., May 2007.] proposed RGC (Regenerating Codes) on the basis of network coding theory. RGC codes satisfy the MDS property as well. During the repair process of the RGC codes, a new storage node needs to connect to d nodes from the rest ones and download β data from each node respectively, which indicates that the repair bandwidth of RGC is dβ. The paper at the same time gives the functional repair model of RGC and proposes two optimal codes: MSR (Minimum-Storage Regenerating) codes and MBR (Minimum-Bandwidth Regenerating) codes. RGC codes are superior to RS codes in view of repair bandwidth, but RGC repair process needs to connect to d (d>k) storage node (d is called repair node). Moreover, the repair node needs to random linear network code its stored data. In order to meet the independence demand of all the coded packets, RGC codes should operate on a relatively large Gialois Field GF (q).
The patent PCT/CN2012/083174 proposes a practical projective self-repairing code with its coding scheme, data reconstruction and repair method. PPSRC (Practical Projective Self-repairing Codes) have two typical properties of self-repairing codes as well: the lost coded packet can be repaired with less data than the entire file downloaded from other surviving coded packet; the lost packet can be repaired from a given number of fragments, where the given number only relates to the amount of lost fragments but have nothing to do with what exactly is lost. These properties make the repair load relatively low. In addition, each node in the system has the same status which provides ability to repair different lost packets independently and concurrently in different place of the network.
Except for the above conditions, PPSRC codes also have the following properties: there are (n−1)/2 pair repair nodes to choose from when a node fails; we can still repair the failed nodes with arbitrary two of the remained (n+1)/2 nodes if (n−1)/2 nodes fail at the same time.
The coding and self-repairing process of PPSRC codes only involve XOR operations, in spite that the general PSRC (Projective Self-repairing Codes) seems relatively complex with polynomial operations, i.e. the computation complexity of PPSRC is less than PSRC. Meanwhile, PPSRC is superior to PSRC in repair bandwidth and repair nodes. The redundancy of PPSRC is controlled, which makes it suitable for general storage systems. Moreover, PPSRC has the optimal reconstruction bandwidth.
In a word, PPSRC has effectively reduced storage nodes and system redundancy, resulting in the improvement of PPSRC's use value.
PPSRC codes, however, have some shortcomings First of all, PPSRC has a complex encoding and decoding scheme, reflect in that the partition operation of Galois Field and its subfield is relatively large, and the reconstruction is rather tedious. Secondly, in PPSRC codes, coded blocks are inseparable, so that the repair blocks must be inseparable as well. Thirdly, the entire coding and decoding process of PPSRC expend a high operation complexity, the redundancy in fact is quite large even though it's controlled. Generally, a great many storage nodes are used in PPSRC, which seems entirely unnecessary for small files. These all increase the implementation difficulty in practical distributed storage systems. Thus, the PPSRC codes have a weak generality.
Hierarchical Codes (HC) is proposed in the paper [A. Duminuco, E. Biersack, “Hierarchical Codes: How to Make Erasure Codes Attractive for Peer-to-Peer Storage Systems”, Peer-to-Peer Computing (P2P), 2008.] according to the property that the arbitrary third fragment can be recovered from the other two if any one of the three fragments is obtained from the other two by XOR operations. HC code is an iterative structure, we construct a large code gradually from the small EC codes through XOR operations between the sub-modules of the EC code.
Its main idea is: when considering a file with the size of s×k, we divide the file into s sub-blocks, each contains k original modules. Generate (n−k) local redundant coded modules with a (n, k) EC code in each sub-block. Through the coding scheme, further generate r global redundant coded modules with all the s×k original modules, so that we get a coded group which consists of (s×n+r) coded modules generated from s×k original modules. Local redundant modules can be used to repair the failed nodes in its sub-block so that the repair only needs less data than the whole file, while the global redundant modules provide a further repair, i.e. we can repair the failed modules with the global redundant modules when there are too many modules failed that they can't self repair. The system structure of HC codes is asymmetric so that some modules seem prior to others, resulting in that it's very difficult to make a deep analysis of resilience (it affects comprehension of coding effectiveness); it requires more complex algorithms to implement the coding scheme in actual systems (both for reconstruction and repair); the amount of modules required to repair the failed modules depend on not only how many modules are failed but also what specific ones are failed as a result that the status of different modules are different; similarly, the amount of modules required to reconstruct the original file may vary with different failed modules.
The patent PCT/CN2012/071177 proposed a RGC (Regenerating Codes) code, in which we only need a small fraction of data to repair a lost fragment, rather than reconstructing the entire file. Linear network coding is applied to RGC codes. It improves the repair expenses with the NC (Network Coding) property (i.e. max-flow min-cut). It can be proved in network information theory that network overhead with the same amount of data as the lost fragment is enough to repair the lost fragment.
The main idea of RGC codes is still MDS property. If some storage nodes are failed, which is equivalent to data loss, we need download information from the valid nodes to repair the lost fragment and store new data in new nodes. As time goes on, many original nodes may fail, some newly-regenerated nodes may participate in the repair process to generate more new nodes. Therefore, we should ensure two points during regeneration: 1) the failed nodes are independent with each other, the regeneration process can recycled; 2) arbitrary k nodes are enough to reconstruct the original file.
FIG. 2 describes the regeneration process after a node failure. Consider a distributed system with n storage nodes, where each node stores α packets. If a node fails, a new node download β packets each from other d (d≧k) surviving nodes for regeneration. Each storage node can be denoted by a pair of nodes Xiin, Xiout which are connected by an edge with a capacity equal to the amount of data stored at the node (i.e. α). The regeneration process is by an information flow graph, wherein Xin collects β packets each from arbitrary d surviving nodes and stores α packets in Xout through
            X              i        ⁢                                  ⁢        n              ⁢          ⟶      α        ⁢          X      out        .          ⁢      X    out  can be visited by any receiver. The max-flow between the source and sink is decided by the min-cut in the graph, which should not be smaller than the original file when the sink reconstructs the original file.
There is a trade-off between the storage a of each node and required bandwidth γ to regenerate a node, therefore, we additionally introduce MBR (Minimum Bandwidth Regenerating) codes and MSR (Minimum Storage Regenerating) codes. It's obvious that the minimum storage is M/k bits, from which we can reach that in MSR codes
                              (                                    α              MSR                        ,                          γ              MSR                                )                =                  (                                    M              k                        ,                          Md                              k                ⁡                                  (                                      d                    -                    k                    +                    1                                    )                                                              )                                    (        9        )            When d reaches the maximum value, i.e. a newcomer connects to all the (n−1) surviving nodes, the repair bandwidth γMSR can reach its minimum value
                              γ          MSR          min                =                              M            k                    ·                                    n              -              1                                      n              -              k                                                          (        10        )            MBR codes have the minimum repair bandwidth, from which we can conclude that RGC have the minimum repair overhead while d=n−1, i.e.
                              (                                    α              MBR              min                        ,                          γ              MBR              min                                )                =                  (                                                    M                k                            ·                                                                    2                    ⁢                    n                                    -                  2                                                                      2                    ⁢                    n                                    -                  k                  -                  1                                                      ,                                          M                k                            ·                                                                    2                    ⁢                    n                                    -                  2                                                                      2                    ⁢                    n                                    -                  k                  -                  1                                                              )                                    (        11        )            
There are three models to repair the failed nodes: Exact repair: the lost fragments are required to be accurately regenerated, where the recovered message is exactly the same as the lost (core technologies are interference queue and NC); Functional repair: it's only required that the repaired system meets the MDS property even if the regenerated fragments contain different data with the failed nodes (core technology is NC); Partially exact repair: it's a hybrid repair model between exact repair and functional repair, where the system nodes (the stored un-coded data) is required to be exactly regenerated (i.e. the recovered messages should be the same as the original information stored in the failed nodes) while the non-system nodes (the coded fragments) can be functional repaired to satisfy the MDS property (core technologies are interference queue and NC).
In order to apply RGC codes to practical distributed systems, we need connect to at least k nodes to repair the failed nodes even it's not the optimal case. Therefore, RGC codes possess a high protocol overhead and a high system design complexity (NC technology) even though the required repair bandwidth is relatively low. Besides, RGC codes haven't taken engineering solutions into consideration, such as lazy repair process, which consequently can't avoid the repair bandwidth produced by the temporary failures. Finally, the NC based coding and decoding of RGC codes require a high computation expense, which is an order higher than traditional EC.