In a storage system with a plurality of storage units, data is often stored in a redundant manner. When one or more of the storage units experiences a failure and its associated data is lost, data redundancy allows the data of the failed storage units to be recovered from the operational storage units (assuming there is sufficient redundancy). While it is certainly beneficial that data on a failed storage unit can be recovered, there are certain costs (and concerns) associated with the data recovery process.
First, data recovery consumes resources of the storage system that would otherwise be available to process read and/or write requests of a host. For example, data recovery in most cases involves reading content from the operational storage units in order to recover the lost data. In many cases, once the content is read (e.g., in the form of data blocks and parity blocks), it must be further processed in order to reconstruct the lost data. Such reads and processing of a data recovery process may increase the time it takes for a storage system to respond to read and write requests from a host.
Second, the longer the data recovery process takes, the longer the storage system operates in a degraded mode of operation. In a degraded mode, any data requested from the failed storage unit must be first reconstructed (if it has not already been reconstructed) before the request can be fulfilled, increasing a storage system's response time to read requests. Further, a reduced level of data redundancy makes the storage system more vulnerable to permanent data loss.
One way to address such concerns is to shorten the data recovery process, and one way to shorten the data recovery process is to reduce the amount of data that needs to be recovered. Such approach, of course, is not always possible. Indeed, if all the data of a storage unit were lost and that data is needed, there is no choice but to reconstruct all the data of the storage unit, in a process known as a “full rebuild” or a “full reconstruction”. In other cases, however, rebuilding only a subset of the data may be sufficient.
For example, when a storage unit fails, sometimes its data is not lost. In other words, a failure of a storage unit may render the storage unit unresponsive to any read or write requests, but its data is left intact. Upon recovery of the failed storage unit, the problem is that any writes to the storage system that occurred during the failure of the storage unit will not be reflected on the failed storage unit, rendering some of its data “stale”. In this scenario, it is possible to perform a partial rebuild (rather than a full rebuild) on the failed unit, only reconstructing data that is needed to replace the stale data.
While a partial rebuild is preferable to a full rebuild (reducing the amount of time that the system is in a degraded mode of operation and reducing the processing of the storage system), a tradeoff is that the storage system is required to keep track of which data needs to be rebuilt, which takes additional resources as compared to a full rebuild process.