The quantity of digital information that is stored by digital storage systems, be it data, photos or videos, is ever increasing. Today, a multitude of digital devices are interconnected via networks such as the Internet, and distributed systems for data storage, such as P2P (Peer-to-Peer) networks and cloud data storage services, have become an interesting alternative to centralized data storage. Even common user devices, such as home PC's or home access gateways can be used as storage devices in a distributed data storage system. However, one of the most important problems that arise when using a distributed data storage system is its reliability. In a distributed data storage system where storage devices are interconnected via an unreliable network such as the Internet, connections to data storage devices can be temporarily or permanently lost, for many different reasons, such as device disconnection due to a voluntary powering off or involuntary power surge, entry into standby mode due to prolonged inactivity, connection failure, access right denial, or even physical failure. Solutions must therefore be found for large-scale deployment of fast and reliable distributed storage systems. According to prior art, the data to store are protected by devices and methods adding redundant data. According to prior art, this redundant data are either created by mere data replication, through storage of simple data copies, or, for increased storage quantity efficiency, in the form of storing the original data in a form that adds redundancy, for example through application of Reed-Solomon (RS) codes or other types of erasure correcting codes. For protecting the distributed data storage against irremediable data loss it is then essential that the quantity of redundant data that exists in a distributed data storage system remains at all times sufficient to cope with an expected loss rate. As failures occur, some redundancy disappears. In particular, if a certain quantity of redundant data is lost, it is regenerated in due time to ensure this redundancy sufficiency, in a self-healing manner. In a first phase the self-healing mechanism monitors the distributed data storage system to detect device failures. In a second phase the system triggers regeneration of lost redundancy data on a set of spare devices. The lost redundancy is regenerated from the remaining redundancy. However, when redundant data is based on erasure correcting codes, regeneration of the redundant data is known as inducing a high repair cost, i.e. resulting in a large communication overhead. It requires downloading and decoding (application of a set of computational operations) of a whole item of information, such as a file, in order to regenerate the lost redundancy. This high repair cost can however be reduced significantly when redundant data is based on so-called regenerating codes, issued from network information theory; regenerating codes allow regeneration of lost redundancy without decoding.
Lower bounds (tradeoffs between storage and repair cost) on repair costs have been established both for the single failure case and for the multiple failures case. The two extreme points of the tradeoff are Minimum Bandwidth (MBR/MBCR), which minimizes repair cost first, and Minimum Storage (MSR/MSCR), which minimize storage first. Codes matching these theoretical tradeoffs can be built using non-deterministic schemes such as random linear network codes.
However, non-deterministic schemes for regenerating codes have the following drawbacks: they (i) require homomorphic hash function to provide basic security (integrity checking), (ii) cannot be turned into systematic codes, i.e. offering access to data without decoding (i.e. without additional computational operations), and (iii) provide only probabilistic guarantees. Deterministic schemes are interesting if they offer both systematic form (i.e., the data can be accessed without decoding) and exact repair (during a repair, the block regenerated is equal to the lost block, and not only equivalent). Exact repair is a more constraining problem than non-deterministic repair which means that the existence of non-deterministic schemes does not imply the existence of schemes with exact repair.
For the single failure case, code constructions with exact repair have been given for both the MSR point and the MBR point. However, the existence of codes supporting the exact repair of multiple failures, referred to hereinafter as exact coordinated/adaptive regenerating codes, is still an open question. Prior art concerns the case of single failures and a restricted case of multiple failure repairs, where the data is split into several independent codes and each code is repaired independently, using a classical repair method for erasure correcting codes. This case is known as d=k, d being the number of nodes contacted during repair and k being the number of nodes contacted when decoding. The latter method does not reduce the cost in terms of number of bits transferred over the network for the repair operation when compared to classical erasure correcting codes.
Document “Exact minimum repair bandwidth cooperative regenerating codes for distributed storage systems”, Proceedings of the 2011 IEEE international symposium on information theory, is limited to the above discussed case d=k because the method described in this document is not powerfull enough to allow d>k. Having d>k allows to obtain a reduction of repair cost in terms of amount of data to be transmitted in the network between nodes; more nodes are contacted, but for finally lesser data exchanged. This relation is not linear, i.e. the more nodes are contacted, the lesser total data is exchanged. Because of this limitation to the case d=k, the method described in the document cannot take full advantage of regenerating codes and the repair costs remain equivalent that observed for systems using erasure correcting codes such as RS (Reed-Solomon). The method described in the document does not use network coding, i.e. generation of new data blocks from encoded data without a decoding/encoding step.
Thus, prior art solutions for regeneration of redundant data in distributed storage systems that are based on exact regenerating codes can still be optimized with regard to the exact repair of multiple failures. This is interesting for application in distributed data storage systems that require a high level of data storage reliability while keeping the repair cost as low as possible.