Erasure coding, often referred to as Reed-Solomon coding, is an object-level, parity-based scheme for preventing data loss resulting from storage system failure. In erasure coding, data is partitioned into k data chunks, which are encoded into m parity chunks and both data and parity chunks are stored across distributed storage subsystems called failure domains.
Despite the advantages of erasure encoding, it exhibits various inefficiencies related to data reconstruction. Erasure-coded systems are configured on commodity hardware that is prone to failure. While the storage capacity associated with these devices has been rapidly increasing, the access speed has not kept pace. As such, the time to recover corrupted or lost data has been significantly increasing.
When a storage failure occurs, the data on the failed storage device is reconstructed from the remaining non-faulted storage subsystems/nodes. Data/parity are read from a fixed set of nodes and the failed data is computed. However, quite often one or more of the nodes employed for the reconstruction experiences high read access latency, which extends the time for the reconstruction even further. Additionally, all reconstruction computations are performed on the I/O initiator node. Since reconstruction or recovery of a node occurs relatively frequently this can tax the processor of the input/output (“I/O”) initiator node.
In view of these deficiencies in traditional erasure-coding systems, the instant disclosure identifies and addresses a need for systems and methods for selecting a set of storage nodes from a plurality of storage nodes for use in reconstructing data on a faulted node in an erasure-coded system.