Aspects of this invention are generally related to data storage, and more particularly to distributed data storage systems such as a redundant array of inexpensive disks (RAID). Enterprise storage platforms are relied upon to maintain data which may critical to the operation of enterprises. The storage platforms typically includes features such as RAID storage groups to help maintain and protect the data. Various levels of RAID storage are known in the art, but in general RAID systems are designed to enable recovery from data corruption and failure of a physical storage device such as a disk. A level 1 RAID system, for example, maintains copies of a set of data on two or more physical storage devices such as a mirrored pair of disks. Consequently, if one of the disks fails the data is still available from the mirror disk. While RAID 1 is highly reliable, it will be appreciated that it can require considerable storage capacity. Features such as parity data are used in some other RAID levels in order to achieve reduced storage capacity requirements. Features such as byte or block level striping are used in some RAID levels in order to achieve enhanced response time. RAID 5, for example, uses block level striping with parity data distributed across all devices. Generally, there are tradeoffs between reliability, efficiency, and response time.
Reliability of the various RAID levels can be expressed in terms of the number of storage devices in the RAID. A RAID of N+r storage devices can sustain r failures. In other words, the RAID will fail on the (r+1)th failure, where a storage device that can no longer perform IOs is considered to be failed. Typical values are r=1 for RAID 1 and RAID 5, and r=2 for RAID 6. If the state is of the RAID is defined by a tuplet [Number drives up, Number drives failed], a failure can be considered as moving from the [(N+r), 0] state to the [(N−1+r), 1] state. Thus for r=1, the progression of states [N+1, 0]→[N, 1]→[N−1, 2] is the sequence from full redundancy to failure. The [N, 1] state is an exposed state. It is possible to rebuild from the [N, 1] state back to the fully redundant [N+1, 0] state, i.e. [N, 1]→[N+1, 0]. For example, parity incorporated into the RAID data may be used to perform data reconstruction after the failed storage device is replaced with a new functional storage device. Thus the [N+r−1, 1] state can be rebuilt to the fully redundant [N+r, 0] state using parity. However, rebuilding from the [N, 1] state back to the fully redundant [N+1, 0] state can take considerable time, and during that time a second storage device failure can result in data loss. Consequently, a race condition exists between rebuild and a subsequent storage device failure. The generally accepted Paterson probability model for RAID failures follows the sequence of states described above from the fully protected state to the failed state. The model assumes that failures are random and follow a constant failure rate. The model also assumes that rebuild times are much smaller than the mean time to failure of each storage device. The probability model of data loss is the product of the probabilities of moving along the sequence of states.
One problem with the generally accepted model is that it fails to account for undetected and uncorrected faults. Such faults include recoverable internal defects such as bad blocks which occur prior to the rebuild process. A wide variety of fault modes can create bad blocks. Undetected and uncorrected faults are problematic because they can cause data unavailability and data loss in the RAID group. Data unavailability refers to the inability of the storage system to service host IO requests within an acceptable interval because data cannot be written or accessed until the drive set and its data is restored from a source that may be external to the storage system. Data loss refers to the inability to service host IO requests due to either the inability to restore data without an unacceptably long outage or the inability to restore the data from any source, i.e., irrevocable data loss. An example is where data required to rebuild a failed storage device is associated with an undetected or uncorrected fault on another storage device, e.g., parity data on a bad block on another drive in the RAID group.
Another problem is that drive health can be difficult to determine with generally available metrics. Bit error rate (BER) metrics, for example, only relate to bit errors in the head channel. However, media errors outweigh head channel faults by multiple orders of magnitude. Consequently, BER is a weak predictor for future data integrity, drive failure, data unavailability and data loss. Further, error counts based on IOs, such as those taken over a SCSI interface, take long periods of time to resolve and consequently can leave drives with latent errors over undesirably lengthy time intervals.