Computer systems generally employ disk drive devices for storage and retrieval of large amounts of data. Disk drives may degrade and their failure in large storage systems may cause serious problems. Such failures are usually attributed to the defects in the recording media, failure in the mechanics of the disk drive mechanisms, failure in electrical components such as motors and servers, and failure in the electronic devices which are a part of the disk drive units, as well as a number of other attributable causes.
During the normal operation, disk drives may have a number of failure modes which have been identified by the disk drive industry. Some failure modes initially present themselves as an inability to read and/or write data. These are reported to a user or host computer as error codes after a failed command. Some of the errors are the result of medium errors on magnetic disk platters, the surface of which can no longer retain its magnetic state.
Disk drives (disk storage devices) may be temporarily “failed”, e.g. switched off-line for several reasons, including error recovery, such as for example a reset, or a power cycle change. A disk storage device may also be failed due to a failure in the communication path such as a cable, small form-factor pluggable (SFP) optical transceiver, or due to an enclosure issue, etc.
The most common type of a drive array is the RAID (Redundant Array of Independent Drives). RAIDs use several inexpensive disk drives with a total cost which is less than the price of a high performance drive to obtain a similar performance with greater security. RAIDs use a combination of mirroring and/or striping for providing greater protection from lost data. For example, in some modifications of the RAID system, data is interleaved in stripe units distributed with parity information across all of the disk drives.
Current RAID systems provide reliable data storage by constructing drive groups with the added data redundancy based upon the RAID level used. For example, RAID-6 system uses a redundancy scheme that may recover from a failure of any two disk drives. The parity scheme in the RAID utilizes either of two dimensional XOR algorithm or a Reed-Solomon code in a P+Q redundancy scheme.
In all RAID systems, the disk drives are deemed either operational or failed by the control system. The failed disk drives are typically flagged for physical replacement. It may happen, however, that the disk drives flagged for replacement, may be repairable.
Modern disk drives are provided with built-in recovery mechanisms which require a rather lengthy system operational time and may need a disk drive controller interaction. Normally, a disk storage system implementing a RAID algorithm, rebuilds all of the data on failed disk storage device. This operation may require, as an average, several hours of the operational time to reconstruct a single disk, and possibly, several days to reconstruct all disks on a failed channel. During this period of time, the data storage system may be susceptible to data loss if remaining disks in the RAID groups become inoperable.
Therefore, there is a need in the industry to avoid unnecessary physical replacement of disks exhibiting anomalous behavior by rebuilding temporarily failed (off-line) disks in a most effective manner to limit the amount of time needed for disk repairment.