1. Field of the Invention
The present invention relates to technology for controlling data storage systems, and more particularly to early detection of storage device degradation.
2. Background Information
Data storage subsystems having plural devices, such as Redundant Array of Inexpensive Disk (RAID) arrays, store customer data on multiple storage devices to avoid data loss if one or more data storage device fails. The present invention relates to such plural-device storage subsystems, which may comprise disk storage, such as RAID arrays. It is clear that other storage devices, including, but not limited to, tape and flash memory devices, may also be included in such a subsystem.
To minimize the time the system operates without redundancy, some storage devices may be configured as “hot spare” devices. Under normal operating circumstances, these hot spares do not hold customer data. As soon as an array member fails in operation, however, the subsystem chooses a hot spare to replace the failed array member and then rebuilds the array using the hot spare. While the array is in the process of rebuilding it operates with reduced redundancy, which means there is a pressure to rebuild as quickly as possible.
A further possible use of a storage device, apart from as a member, a spare, or a failed device, is as an array candidate or free device. These are devices that are ready to be used either by being configured as array members in the creation of a raid array or by being configured as hot spares. In the meantime, they remain idle, and any degradation they might have suffered remains invisible to the subsystem control components.
In spite of the built-in redundancy and the use of hot spares, there remains the problem that a storage device can fail when it is idle and in that case the subsystem does not detect the failure until it attempts substantial I/O to the device. This is usually at an inconvenient time to detect the failure, i.e. just when the device is needed to accept data I/O activity.
Hot spare devices sit idle while waiting for array members to fail. Their condition can degrade during this time such that they fail when the subsystem attempts to do substantial I/O to them during the array rebuild. This is disadvantageous because the subsystem must operate longer with reduced redundancy (and, in the worst case, if there are no more spares, manual intervention by way of physically replacing a device may be required), thus increasing the chance of data loss. As has been pointed out above, the same problem exists for free devices: their condition may degrade while they stand idle and the system does not detect this until they are turned into array members and subjected to an I/O workload.
Indeed, this same problem may exist for devices that are configured as active array members in some cases. Lightly loaded or periodically idle arrays are not unusual. When an array member fails a rebuild begins. The rebuild operation is typically designed to run as quickly as possible, which puts stress on all the other array members, typically including some active array members that have not been used for I/O for some time. Another disk failure at this point causes an array loss for singly-redundant RAID systems (e.g. RAID 5), and in other subsystems at least slows down the process by requiring repeated I/Os.
Disk drives implement self-test procedures of varying degrees of complexity. In some enterprise class disk drives, this includes a background media scan which checks that blocks remain readable. These features may improve matters, but they do not attempt to simulate a “real-life” customer workload. This solution is also disk drive vendor and model specific, and, in particular lower cost (e.g. SATA) disk drives are less likely to implement these features.
RAID controllers implement array and disk scrubbing. These processes check that all blocks of array members, hot spares and free disks are readable every few days. Again these features do not attempt to simulate a customer workload. The process is also potentially disruptive to normal processing because of the amount of I/O time and device resource that is diverted to attend to them.
Various advanced RAID techniques exist that use distributed sparing. In this scheme there are no dedicated hot spare devices, but instead spare capacity is distributed around each array. This eliminates the hot spare problem completely. However, it is very hard to retrofit such a scheme to an existing RAID architecture with dedicated hot spare devices.
In the light of the disadvantages with the above-mentioned techniques, it would be highly to have a technological means for the early detection of disk degradation, without the cost and inconvenience incurred by complex and cumbersome measures.