Distributed data-storage systems (“DSSs”) are complicated software solutions that operate over hardware infrastructures consisting of a large number of servers of various designations that are connected together by multiple communication channels. Existing DSSs are subject to various kinds of hardware failures, including total or partial switching off of the electrical supply, network failures that may result in the DSS being divided into separate unconnected segments, disk failures, and the like.
Conventionally, there are two main approaches to ensuring reliability of data storage in conditions of failure, based on the duplication of information and the spreading of data over different components of the distributed data-storage system. The first approach is data replication and the second approach is erasure coding.
In general, data replication is the storage of each block of data (i.e., file or object, depending on the architecture of the data-storage system) in several copies on different disks or different nodes of the system. As a result, replication makes it possible to ensure maximum efficiency of data access, including a significant increase in speed of read access to data frequently used by different clients. However, data replication can be very costly from the perspective of the amount of disk space needed to create the several copies of each block of data.
Moreover, noiseless (or lossless) coding is based on the use of mathematical algorithms that make it possible, to generate n chunks (i.e., data fragments or “derivatives”) of a block of data using compression techniques in such a way that any k chunks will be sufficient to recover the initial block of data. Each of the n chunks obtained should be written to a separate disk, and, preferably, to a separate server to ensure high availability of the data. The reliability of a scheme of noiseless coding with parameters n,k (an “(n,k) scheme”) is comparable to the reliability with replication of data with n−k+1 copies of each data block. Thus, the use of noiseless coding makes it possible to reduce considerably the overheads on data storage necessary for data replication—the redundancy of data storage (i.e., including the ratio of the volume of storable data to the volume of useful data) for an (n,k) scheme is equal to n/k (the size of a data chunk is approximately equal to SizeBlock/k, where SizeBlock is the volume of the initial block of data). The most widely used error-correcting codes in modern software systems of data storage are Reed-Solomon codes and variations, such as Cauchy codes, for example.
Due to the size and complexity of DSSs necessary for large volumes of data, the reliability of storage and accessibility of data in these systems depends not only on the number of replicas or parameters of the noiseless-coding scheme being used, but also on the global scheme of data distribution, which determines to a considerable degree the performance of the storage system. The rapid development and introduction of cloud data technologies has resulted in the creation of large and super-large platform owners—such as Amazon S3® and EC2®, Microsoft Azure®, Google®, and the like, that have started solving these problems. Today providers can already have dozens or even hundreds of points of presence at different geographical locations to provide/support DSSs. The task of reliable and effective data storage demanded solving the problems of data redundancy distribution management on geographically distributed servers, including the challenges of optimizing storage traffic and volume, as well as failures handling. In order to come up with an economically effective solution, it is necessary not only to distribute replicas on all available servers and to control the execution of sufficiency conditions for data redundancy, but also to use them for data delivery to clients (analog of CDN—content delivery network).
As described above, simple replication of data on different servers has already proved to be inefficient. However, to ensure appropriate levels of fault tolerance for the erasure codes or the (n,k) scheme, especially in the conditions of geographically distributed storages, these methods still require semi-manual methods of management (i.e., predetermined schemes of data distribution and replication and so on).
Thus, there is a need for methods and algorithms of data distribution management using erasure codes or the (n,k) scheme.