It is common practice to store information with some level of redundancy to avoid data loss in case of hardware failure. This redundancy can be introduced by way of duplicating (or replicating) the data or by adding some redundancy encoding (e.g., parity blocks or erasure codes). Redundancy data or data replicas may be stored across several domains to allow data recovery in case of a domain failure.
Replicas may be stored on different disks, servers, racks or geographically distant sites, so that failure of a single disk, server, rack or even a whole data center would not cause data loss. Thus, built-in redundancy makes a system resilient against simultaneous failures. Regardless, a system can only withstand a certain number of simultaneous failures. For example, a storage system with triple replication is protected against a double disk failure, but not a triple disk failure.
To provide reliability via replication, once a failure happens, the system restores the data that was located on the failed hardware from the remaining replicas to a predetermined redundancy level. During restoration, the system is vulnerable to additional failures, so it is important that the restoration is performed as quickly as possible. Not all data involved in the restoration process is equally vulnerable, however. For example, data with a smaller number of replicas can be generally considered as more vulnerable.
It would be desirable to enhance the reliability of a storage system based on a recovery scheme that dynamically prioritizes the recovery of more vulnerable data. Implementing such recovery scheme is typically not readily possible because the recovery components in a storage system may be implemented in hardware, or subject to license terms that prohibit modifications to the system's recovery mechanism. Systems and methods are needed that come overcome the above shortcomings.