Computer systems generally employ disk drive devices for storage and retrieval of large amounts of data. Disk drives may degrade and their failure in large storage systems cause serious problems. Such failures are usually attributed to the defects in the recording media, failure in the mechanics of the disk drive mechanisms, failure in electrical components such as motors and servors, and failure in the electronic devices which are a part of the disk drive units, as well as a number of other attributable causes.
During the normal operation of disk drives whether now or previously operational such disk drives may have a number of failure modes which have been identified by the disk drive industry. Some failure modes initially present themselves as an inability to read and/or write data. These are reported to a user or host computer as error codes after a failed command. Some of the errors are the result of medium errors on magnetic disk platters, the surface of which can no longer retain its magnetic state.
As the density of data per square inch of information carriers such as disks, has increased greatly over the years, the susceptibility to errors caused by physical defects has become a greater problem to manufacturers. To combat these media issues, various predictive failure methods have been developed that identify potential failures and aggressively remove suspect areas of the magnetic media from use before the disk drive is released. There are, for example, algorithms that predict media failures due to surface scratches. These algorithms are usable at the time of fabrication but are likely to fail within the usable life of the disk drive. There are also algorithms in the drive software that create lists (aka G-list or grown defect list) of new defects that are detected during operational life of the disk drive.
However, a particular defect may not be timely identified and there may be a significant delay time before the defect is added to the defect list. For example, a drive may have a limit of 50 failed attempts to read a particular area in response to a single command from the host CPU before the media error is considered significant enough to be “mapped out” of the usable space on the drive. Therefore, one physical media area may be encountered a number of times and would still not trigger the G-list mechanism.
The industry has adopted error correction and detection algorithms in software and hardware that automatically correct errors in the data that are read from the media. The usual measure of reliability in a communication system such as for example a “bit error rate” becomes obscured when the errors are automatically corrected. As the process continues to evolve, one cannot rely on the internal mechanisms of the disk drive to identify potential data errors in a way that is timely enough to maintain a high through-put and high reliability system. By the time a single drive media error is corrected internally to the disk, the performance across the entire storage system may have already suffered significantly.
Early drive replacement rates in large scale storage systems are typically 2-4% with rates, possibly exceeding 10%. If a single drive with otherwise undetected media errors causes a performance degradation then storage systems that use multiple drives for logical units, such as in RAID systems, may be greatly impacted. The potential exists for the slowest component to dictate the maximum through put of the system which is unacceptable in industry.
The most common type of a driver array is the RAID (Redundant Array of Inexpensive (Independent) Drives). The RAID uses several inexpensive drives with a total cost which is less than the price of a high performance drive to obtain a similar performance with greater security. RAIDs use a combination of mirroring and/or striping for providing greater protection from lost data. For example, in some modifications of the RAID system, data is interleaved in stripe units distributed with parity information across all of the disk drives. RAID-6 system uses a redundancy scheme that can recover from a failure of any two disk drives. The parity scheme in the RAID utilizes either a two dimensional XOR algorithm or a Reed-Solomon code in a P+Q redundancy scheme.
Even utilizing the RAID architecture, for example, RAID-6, such systems while having the ability to detect failures in up to 2 disk drives, still need a mechanism of identifying a disk, and/or a disk storage channel in error. Without the ability to identify the problematic storage disk, the more fault tolerant parity algorithm of the RAID-6 system is unable to provide a satisfactory problem free performance. It is important to detect problematic disks while they are still “healthy” so that they can be scheduled for replacement to ensure that the stored data is not lost and that the overall performance of the storage system is not undermined.
The disk drive industry is currently using the Self-Monitoring, Analysis and Reporting Technology (SMART), to determine when a drive is likely to fail. Several of the SMART parameters do correlate well with determining when a drive is likely to fail. However, this technology often misses drives that require replacement. Most drives that fail in large systems are not detected by SMART since they report no SMART errors.
Therefore, there is a need in the industry for failure preventive tool to detect problematic disks in RAID storage systems which is more comprehensive and defect sensitive than the current technology.