Network coding (NC), born in 2000, is a new breakthrough in solving the channel capacity limit problem, followed by C. E. Shannon published paper “A Mathematical Theory of Communication”. It solves the problem in network communication that how the single/multi-source to multi-receiver points multicast/broadcast reached the limit of network capacity.
In traditional, the routing switches of the network communication nodes only complete store-and-forward function. NC points out that if we allow the routing switches to send information after encoding input information flow, then the network nodes will complete both routing functions and encoding functions. In this new architecture, network performance can reach the theoretical limit of the max-flow problem.
With the development of the scale of storage system, failure rate also increases significantly, more and more people propose for higher fault tolerance of storage system. The prior technique achieves the reliability of distributed storage system by erasure coding primarily. Compared to RAID systems, the commonly used RAID-5 products can provide recovery of one single disk failure, and RAID-6, which can repair double disk failure, gradually comes into practice. The principle of RAID-5 system, that tolerate one single disk failure, lies in parity check. In order to reach all aspects of performance optimization, RAID-6 system that can tolerate double disk failure, needs a “special” erasure code.
The limited factor of erasure code in distributed systems decreases to some extent, such as using the Galois field operations instead of XOR operation. In general, the status of each distributed node is the same, which means that it does not necessarily require a systematic coding. In addition, the scale of distributed system is usually very large, which also requires that the coding rate can't excessively decrease as the scale increases. It's quite popular to use Reed Solomon coding (MacWilliams, and Sloane, 1977) in distributed systems, and the degree of redundancy can be achieved as practical needs. Encoding and decoding processes are operated in relatively large Galois field. Obviously, the whole operation cost is significantly larger than XOR operation. Reed Solomon coding principle is established on the polynomial theory, there are many forms of generator matrix. A widely used coding technique is to use Vandermonde matrix as generator matrix:
  G  =      (                  I        n            ❘                                    1                                1                                1                                …                                1                                                1                                2                                              2              2                                            …                                              2                              n                -                1                                                                          1                                3                                              3              2                                            …                                              3                              n                -                1                                                                          ⋮                                ⋮                                ⋮                                ⋮                                ⋮                                                1                                n                                              n              2                                            …                                              n                              n                -                1                                                          )  
Here, both m and n can be flexibly chose as needed, in order to achieve the arbitrary coding rate. Erasure codes use the mature coding theory to chunk the original data and encode the data chunks. That is, the original data is divided into n parts, and meanwhile, generates m redundant blocks. Arbitrary n of the total n+m blocks are able to recover original data. Wherein, some nodes store the original data blocks, while others store the encoded redundant blocks. It's obvious that their statuses and functions are different, and sometimes central node is needed in the encoding process in the distributed environment.
Subsequently, random linear coding scheme is proposed for the distributed storage. Even though it also achieves distributed data storage, and the entire file is divided into several blocks, encoded blocks joint together by random linear combination of all blocks. However, it is required to store each encoded block's coding vector, and the missed file is recovered by collecting encoded vectors and encoded blocks obtained from other nodes. This increases the amount of storage and data processing load of the node, and also the communication bandwidth during the repair process of the node.
In summary, the existing data storage methods can not guarantee the reliability of the distributed network storage system.