A fault-tolerant, or “recoverable”, storage system is one that permits recovery of original data even in the event of partial system failures. A system can achieve recoverability by any of several means. One such method is replication, i.e. by keeping multiple copies of data. Replication is the primary recovery method used in RAID (“Redundant Array of Independent Disks”) systems. Alternatively, a system can use an error correction code (“ECC”) with proper redundancy to achieve recoverability. In general, error correction codes, of which erasure codes are a subset, are data representations that allow for error detection and error correction if the error is of a specific kind. Replication and error correction coding both use redundancy in order to ensure fault tolerance. The use of one or the other, or both, has been a design option for fault-tolerant storage systems since the earliest days of RAID.
A distributed hash table (“DHT”) stores (key, value) pairs in a distributed system consisting of a set of nodes. Each node is responsible for a unique subset of keys, and all the nodes together are responsible for all possible keys. For example, if the keys are numbers in the range [0,1), then each node could be responsible for a connected subrange of numeric keys. Each node knows its neighboring nodes (i.e., it can communicate with its neighbors directly), so the DHT typically, although not necessarily, consists of a ring of nodes. A node can also be aware of other non-neighboring nodes, in order to increase connectivity and decrease the communication distance (hops) between nodes. A DHT can find the node responsible for a given key by starting at any node. If the node is not itself responsible for the key, then it queries the node it knows with the key closest to the desired key. This “greedy algorithm” converges quickly (generally logarithmically or better) to find the node responsible for the desired key.
Currently, in existing storage systems that employ error correction code redundancy schemes, responsibility for storage of data and for maintenance of storage data resides in a single component. Such systems do not employ distributed hash tables and do not disconnect the responsibility for the storage from the actual maintenance of the storage. As a result, these systems have single points of failure and cannot reconstruct failed drives in less than the time it takes to rewrite the entire drive.