Data intensive applications require extreme scaling of their underlying storage systems. Such scaling, together with the fact that many storage systems are implemented in actual data centers, increases the risk of data loss from failures of underlying storage components. Accurate engineering requires quantitatively predicting reliability, but this remains challenging, due to the need to account for extreme scale, redundancy scheme type and strength, and distribution architecture, as well as component dependencies and failure and repair rates.
While no storage system can be completely invulnerable to loss, well known design techniques, including replication and erasure codes, can reduce the probability of loss over a fixed period to a level small enough to be compensated for by other techniques, such as archival or cold storage, recomputation, or insurance. However, a key to engineering this approach is to have a method of quantitatively predicting the reliability of a given large storage system both theoretically and as it is implemented in physical data centers. This knowledge would help in determining which erasure code strength to deploy, a decision that can have significant economic consequences.
A key characteristic of large storage systems is dependency, which has been known to be a dominant factor in their reliability. Data can be lost if a drive fails, but data can also be effectively lost if a host, rack, or data center necessary for accessing the drive go down. Moreover, when a rack goes down it may not only cause many drives to become inaccessible at once, but it is typical that the sudden loss of power may cause permanent damage. Individual storage system components may obey complex stateful failure models, such as Weibull distributions, while others may be well modeled by simple stateless failure models. Thus, modeling detailed hardware dependencies is important to calculating the reliability of the storage system.
In addition to the impact dependency has on the reliability of hardware elements, dependencies can also affect the system at the software level. For example, in a partitioned placement scheme, such as one consisting of two separate RAID arrays, the partitions will fail independently at the software level, and so the aggregate reliability is straight forward to compute (it is inversely proportional to capacity). By contrast, a spread-placed redundant storage system, such as a Hadoop Filesystem (HDFS) or a Quantcast FileSystem (QFS), places file shares randomly across all the drives, with different files placed on different sets of drives. While each file is individually a k-out-of-n system, their aggregate reliability may be much lower than inversely proportional to capacity. As described herein, a k-out-of-n system refers to one in which information is represented redundantly across n underlying components, where the loss of any k or fewer of the n components can be tolerated using a known repair algorithm, as well as the information in the surviving n-k components.