Array controllers employ mechanisms for recovering from drive media exceptions by utilizing the data redundancy inherent in most types of redundant array of independent disk (RAID) storage configurations. However an array controller's ability to recover from such drive exceptions may result in the development of drive reliability problems on one or more drives in a RAID group over an extended period of time. Also the potential for data corruption may increase as drive reliability problems develop on any of the drives in a RAID group. As such, by the time a first drive is failed by the array controller, the remaining drives in the RAID group may also develop reliability problems that the array controller can no longer recover from following a loss of redundant data. Such a scenario often results in the loss of data availability because of a failure of a second drive during the rebuild process for the first failed drive.
The potential for this loss of data availability may be even greater when larger or less expensive drives are used in a RAID group. Even when a sufficient number of reliable drives remain to rebuild data on a failed drive or drives, the rebuild process may be time consuming requiring special hardware and complex reconstruction software procedures. The time it takes to completely rebuild the data from a first drive on a replacement drive increases the potential for a subsequent drive failure on a second drive that will result in a loss of data availability. There may also be a potential for the replacement drive for the first drive to fail during the rebuild process further threatening data availability by expanding the window of opportunity for second drive failure.
One method for handling drive degradation may be for the array controller to wait until a drive Self-Monitoring, Analysis and Reporting Technology (SMART) feature detects an unreliable drive or wait until the drive may be completely unable (e.g. having exhausted the array controller's retry and recovery procedures) to complete a requested operation. The Self-Monitoring, Analysis and Reporting Technology (SMART) is an internal drive technology used by most modern drives that monitors drive operating metrics and exceptions in order to predict when a drive may be unreliable. Some drive types actively report SMART errors while with other drive types require polling of drive conditions for by an external process.
However, SMART may be inadequate at detecting drives with developing reliability problems. For example, thresholds may not be based drive rates, drives may not count all exceptions (e.g. those reported back to the device that initiated the command), drive types may not report errors from which the drive was able to recover and, for some drive types, degraded conditions must be polled for by an external process. An array controller may fail a drive because a required IO command could not be completed or because the drive may be exhibiting degraded performance but at no time does the drive report a SMART error.
Another mechanism for handling decaying drive quality may be for array controllers to perform a background drive media scan that corrects detected drive media errors encountered on the drive media. However, the media scan may be incapable of keeping up with the creation of new media errors. The external analysis of array and drive exception logs cannot provide rapid detection of an unreliable drive because of the inherent delay of the polling cycle and overhead of the error log analysis. The technique can also impact performance and increase total storage cost especially if the function may be performed by a separate service processor.
Another option may be the use of one of the servers or a separate service processor to periodically scan array error logs for a controller and/or drive to detect developing drive reliability problems. Such a method requires issuing in-band or out-of-band commands to all the array controllers and to every drive in the storage system. However, the external analysis of array and drive exception logs cannot provide rapid detection of an unreliable drive because of the inherent delay of the polling cycle and overhead of the error log analysis. The technique can also impact performance and increase total storage cost especially if the function may be performed by a separate service processor.
Another option may be the use of a RAID configuration providing additional data redundancy, such as RAID-6, so the RAID group can withstand more than one drive failure without the loss of data availability. Use of a fewer number of drives in each RAID group reduces the potential for multiple drive failures and subsequent loss of data availability. Drive reliability can be enhanced by limiting the drive input/output (IO) workload generated by an array controller and by utilizing a large array controller data cache in order to reduce the IO workload on each individual drive in the storage system. However, use of a RAID configuration with additional data redundancy, such as RAID-6, impacts performance and requires additional drives which increases total storage cost. Conversely, restricting the number of drives in a RAID group, as a means to improve reliability, increases total storage cost while storage market requirements push for larger number of drives in a RAID group in order to reduce cost.
Other alternatives that limit the drive IO workload or use more reliable drives increase total storage cost. Some ways to reduce the drive IO workload may be using a larger array controller cache or artificially limiting the array performance. However, these mechanisms may increases storage cost or the time to rebuild a failed drive on the replacement drive increases.