The present disclosure relates to distributed storage systems and more specifically, to methods, systems and computer program products for durability and availability evaluation for distributed storage systems.
Data reliability is crucial for distributed storage systems. Distributed storage systems typically use replication and erasure coding schemes to increase their resiliency to failures. Replication stores replicas (copies) of data across different failure domains. Erasure coding divides data into data and parity chunks, and distributes them across different failure domains. The different failure domains can be defined by different storage devices, different servers, racks and even data centers. In distributed storage system all the components are connected by a network, and can be accessed one from another.
Durability and availability are two important metrics that are commonly used for measuring and comparing of the overall reliability of distributed storage systems in general and in particular for cloud storage. As used herein the availability of a distributed storage system is the fraction of time that the data is accessible through the system. As used herein the durability of a distributed storage system is a percentage of the data that remains intact after a predetermined time period. For example, if after a year of use 0.01 percent of the data stored in the distributed storage system has been lost and is not recoverable, the durability of the distributed storage system is determined to be 99.99% (100−.01).
In general, it is not practical to measure the availability and durability of a distributed storage systems using a short running benchmark test or based on a scaled down system because both availability and durability are directly influenced by scale and by low probability events (failures) that occur over time. Nevertheless, estimates of an expected availability and durability of a distributed storage system are critical when designing, deploying and operating a distributed storage system.
One common approach to estimating the durability and availability of a distributed storage system is to use analytic models that consider simplistic and non-realistic assumptions on the distributed storage system, such as independent exponential distributions for failures and repair. Using these assumptions, Markov models can be constructed to obtain closed form equations for evaluating durability and availability. However, these models do not take into account various characteristics of the distributed storage system such as the realistic distributions of disk, and server failures and the influence of network bandwidth and disk repair bandwidth on disk recovery time, which increases with the number of simultaneous failures. The latter has large impact on the likelihood of additional failures causing data loss and thus on durability and availability. Additional characteristics are system configuration and scale.
Another approach to estimating the durability and availability of a distributed storage system is to use simulation. However, existing simulation methods do not model appropriately the network portion of the distributed storage system and the influence of network bandwidth, disk bandwidth and simultaneous failures on disk recovery time. This is despite the fact that these factors can have a large impact on the probability of data loss and data unavailability and thus on durability and availability.