Distributed data-storage systems (“DSSs”) are complicated software solutions that operate over hardware infrastructures consisting of a large number of servers of various designations that are connected together by multiple communication channels. Existing DSSs are subject to various kinds of hardware failures, including total or partial switching off of the electrical supply, network failures that may result in the DSS being divided into separate unconnected segments, disk failures, and the like.
Conventionally, there are two main approaches to ensuring reliability of data storage in conditions of failure, based on the duplication of information and the spreading of data over different components of the distributed data-storage system. The first approach is data replication and the second approach is erasure coding.
In general, data replication is the storage of each block of data (i.e., file or object, depending on the architecture of the data-storage system) in several copies on different disks or different nodes of the system. As a result, replication makes it possible to ensure maximum efficiency of data access, including a significant increase in speed of read access to data frequently used by different clients. However, data replication can be very costly from the perspective of the amount of disk space needed to create the several copies of each block of data.
Moreover, noiseless (or lossless) coding is based on the use of mathematical algorithms that make it possible, to generate n chunks (i.e., data fragments) of a block of data using compression techniques in such a way that any k chunks will be sufficient to recover the initial block of data. Each of the n chunks obtained should be written to a separate disk, and, preferably, to a separate server to ensure high availability of the data. The reliability of a scheme of noiseless coding with parameters n,k (an “(n,k) scheme”) is comparable to the reliability with replication of data with n−k+1 copies of each data block. Thus, the use of noiseless coding makes it possible to reduce considerably the overheads on data storage necessary for data replication—the redundancy of data storage (i.e., including the ratio of the volume of storable data to the volume of useful data) for an (n,k) scheme is equal to n/k (the size of a data chunk is approximately equal to SizeBlock/k, where SizeBlock is the volume of the initial block of data). The most widely used error-correcting codes in modern software systems of data storage are Reed-Solomon codes and variations, such as Cauchy codes, for example.
Due to the size and complexity of DSSs necessary for large volumes of data, the reliability of storage and accessibility of data in these systems depends not only on the number of replicas or parameters of the noiseless-coding scheme being used, but also on the global scheme of data distribution, which determines to a considerable degree the performance of the storage system.
Currently, the most widespread method of distribution of data in modern DSSs is randomized distribution of data chunks over the disks or servers of the system. The popularity of the randomized distribution method is based on both: (1) the simplicity of realization, which does not require the concrete hardware of the network topology of the cluster to be considered, and (2) the fact that a random distribution of the data makes it possible to ensure a sufficiently uniform load over the various nodes of the cluster without using complicated heuristic load-balancing algorithms. Furthermore, a random distribution of data over the disks in the cluster makes it possible to significantly speed up the rate of recovery based on potential parallel recovery of data that had been stored on different sets of disks. Random distribution of data and variants of it are used in such distributed systems as HDFS (“Hadoop® Distributed File System”), GFS (“Google® File Systems”), and the like.
Nevertheless, when a random scheme of data distribution is used for data storage, it is virtually inevitable that there will be loss of data when there is a correlated failure of several disks. For a distribution strategy using a completely random choice of disks for storage of chunks of a block coded by means of the (n,k) scheme, the probability of losing data in the cluster with sufficiently large amount of data blocks when more than n−k disks crash simultaneously grows with the number of disks in the cluster and is close to 1 for storage cluster with hundreds of disks.
This problem occurs due to a large number of variants of the arrangement of chunks of the block, and specifically by a large number of sets of n disks, each of which contains the data of one or several blocks. This results in an increase in the probability of failure when n−k+1 disks crash at the same time, since the probability of an event in which all the failed disks belong to a set of disks that corresponds to one of the blocks of data is increased because the number of disk set variants increases (i.e., the probability increases with an increase in the number of disk set variants).
On the other hand, grouping the disks into nonintersecting sets of n elements (e.g., of the type of RAID 6 independent arrays) helps to increase significantly the reliability of the storage, but does not solve the problem completely, since the time of recovery of the data is increased on account of the decrease of the number of disks from which data can be read in parallel during the recovery. Moreover, the efficiency of access to the data for reading is reduced as a whole.
Furthermore, it should be appreciated that increasing the number n−k of chunks of the control sums also does not solve the problem for a sufficiently large number of disks (i.e., the probability of data loss remains high), and, thus, high overheads and a significant reduction in performance can be expected. In addition, the parameter n of the (n, k) scheme needs to be increased to provide the same redundancy level, but increasing the number of chunks in the (n,k) coding scheme leads to increased latencies in the system on read access since read access latency will be the maximum latency across all chunk read latencies.