There are many different types of storage devices, including standalone and distributed storage devices, each of which can be broadly referred to as a storage unit. For instance, a standalone storage device can be referred to as a storage unit. Likewise, a distributed set of storage devices that are logically arranged for utilization by one or more entities can also be referred to as a storage unit.
A storage unit can include any combination of one or more drive(s), array(s), rack(s), or other storage device(s), including mechanical hard disk drives (HDD) with moving parts, and solid-state drives (SSD) with virtually no moving parts.
Various types of data can be contained within a storage unit, including electronic files, images and other data. Stored data can be referred to broadly as a data set, whether it includes multiple files or is limited to only a single file.
Sometimes, data sets become corrupted or otherwise inaccessible due to software and/or hardware failures. The durability of the data sets and overall resiliency to storage failures can be significantly improved my replicating the data sets in different storage locations.
Mirroring operations, for instance, make one or more redundant copies or replicas of a data set. These copies can be stored within a single storage unit or within multiple separate storage units. When a failure occurs, affecting one or more of the copies, a new copy must be made before the last copy is lost, or else the failure will be relatively catastrophic and the data will be irrecoverable. While mirroring is very robust, it can also be computationally expensive to implement.
RAID (redundant array of independent disks) technologies can also be implemented to protect against storage failures, without the same computational expense as mirroring techniques. With a RAID system, data is split or striped into different symbols that are stored with partity data in different storage locations. When one symbol becomes inaccessible due to a failure, the remaining symbols can be used to recover the lost symbol through XOR'ing processes utilizing the partity data.
RAID systems can be configured with single parity configurations and multiple parity configurations. For single parity configurations, the stored data and corresponding parity data is typically split into three symbols (e.g., D1 (partial data), D2 (partial data) and P (partity information)). A data set that is stored in the foregoing manner can survive a failure of any single symbol and still be reconstructed. For instance, a failure resulting in the loss of either D1, D2 or P is recoverable, by utilizing the alternate two symbols to reconstruct the lost data symbol through XOR'ing reconstruction processes. However, two or more concurrent failures that result in the cumulative loss of D1 and D2, or D1 and P, or D2 and P will be relatively catastrophic, such that it will not be possible to perform the XOR'ing reconstruction processes and the lost data will be irrecoverable.
A multiple parity configuration, such as LRC (longitudinal redundancy check), can survive multiple concurrent symbol losses without suffering catastrophic failure, by utilizing layered and global parity symbols. However, even with a multiple parity configuration, the reconstruction/recovery processes must be completed before too many symbols are lost (e.g., before the data set is reduced to a single symbol), or else it can still become impossible to perform the XOR'ing reconstruction processes.
Mirroring and RAID technologies can be implemented alone and in combination, utilizing one or more shared storage units. However, sharing of storage units can create bandwidth issues. For example, a storage unit (e.g, drive, array, rack, etc.) typically has a limited bandwidth, meaning that it is only able to perform some limited number of I/O operations within a given period of time. The number of I/O operations that can be performed can be further reduced when some or all of the I/O operations are particularly costly in terms of time to perform the operations. These I/O operations include application operations as well as the restitution and reconstruction recovery operations described above.
When there are contentions for bandwidth amongst multiple entities, it is desirable that a storage system should be able to accommodate all of the entities. However, if one or more of the entities is given higher priority than other entities, the higher priority entities may, in effect, block other entities from accessing a desired storage unit. Likewise, high priority operations can also effectively block lower priority operations from being performed.
In some instances, priority contests may result in recovery operations being delayed or blocked until it is too late. For instance, a recovery operation being delayed until a last copy of a data set is lost (in a mirror configuration) or until a critical symbol is lost (in a RAID partity configuration) will be relatively catastrophic. This can be particularly problematic in distributed systems that utilize multiple shared storage units having different storage traits and that are utilized to perform many different I/O operations, with dynamically changing storage conditions.
Thus, it would be useful to have a system that is able to manage storage unit bandwidth and to, even more specifically, dynamically prioritize recovery of data sets within corresponding storage units.
The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one exemplary technology area where some embodiments described herein may be practiced.