The present invention relates generally to data storage systems, and more particularly to dynamically quantifying and improving the reliability of distributed data storage systems.
Reliable storage of data is a critical operation across a wide spectrum of applications: for example, personnel records, financial transactions, multimedia services, industrial process control, and basic research. Data is stored on physical media, such as semiconductor media (for example, flash memory), optoelectronic media (for example, compact disks and digital video disks), and magnetic media (for example, tape and hard drives). For applications requiring high capacity and fast dynamic read/write speeds, magnetic hard drives are currently the most common data storage device. Capacity and read/write speeds of other media continue to increase, however.
For high-capacity data storage systems, multiple data storage devices may be connected together. For example, multiple hard drives may be connected via a local interface to form a data storage unit. Multiple data storage units may then be connected via a data communications network to form a distributed data storage system. Since each device may fail, distributed data storage systems have multiple points of failure. Redundancy is often used to improve reliability, either by replicating the data blocks, as in RAID-1 or replica-based distributed systems, or by storing additional information, as in RAID-5 or erasure-coded distributed systems. Unless the amount of redundancy in the system is extremely large, when a device fails in a large-scale system, the data stored on it has to be immediately reconstructed on other devices, since device repair or replacement may take a long time, and new failures can occur in the interim. Since high redundancy entails the expense of additional devices, however, improving reliability through failure-management policies instead of additional hardware is desirable.
To improve reliability, a quantitative metric characterizing the reliability of a distributed data storage system first needs to be defined. Existing metrics include Probability of Data Loss (PDL) and Mean Time To Data Loss (MTTDL). PDL is estimated either as the percentage of simulation runs that result in data loss or by using a (typically combinatorial) model of the PDL for the system. Similarly, MTTDL is estimated either as the mean of the time-to-data-loss values over a large number of simulations or by using a (typically Markovian) model of the system reliability. Regardless of how they are computed, however, PDL and MTTDL quantify reliability with a single, static measure, irrespective of time or the current state of the system. Although useful in some applications, these metrics provide only a macroscopic, long-term view of system reliability. They are not capable of assessing reliability at each point in time, as device failures, data reconstructions, and device replacements occur.
What are needed are method and apparatus for dynamically quantifying the reliability of a distributed data storage system and improving the reliability without additional device redundancy.