Data integrity is of the highest concerns for many data storage subsystems. These subsystems use various RAID techniques to provide data redundancy so that when data is not available from one of the disk drives, it can be recovered from the other drives in the RAID subsystem. Most subsystems provide only single fault tolerance, as the cost of multi-fault tolerance is high. Unfortunately, unrecoverable read error, also known as hard error, occurs in disk drives. This is the failure situation that arises when a sector cannot be read even after all the steps in a drive's error recovery procedure (ERP) have been exhausted. Such steps include invoking all levels of error correction control (ECC) that have been implemented and re-trying using different head offsets. When an unrecoverable read error is encountered while all the other drives in the RAID are still accessible, that sector can be reconstructed from those drives. However, if the hard error is encountered when one of the drives in a RAID has already failed, then both the hard error sector and its corresponding sector in the failed drive are not recoverable. This becomes a data loss situation.
To reduce the probability of encountering a hard error after a drive has failed, some storage subsystems implement some sort of data scrubbing routine in order to flush out such errors before the data is actually needed. In such a scheme, the RAID controller itself periodically issues verify read (i.e., do not read from cache, read from disk media) commands to each drive controller and cycles through every address in each drive. When an unrecoverable error is encountered and reported by one of the drive controllers, the RAID controller will reconstruct that data from the other drives using RAID redundancy. Some subsystems have auto-reassign enabled in the disk drives so that the sector with hard error is automatically re-assigned to a spare location. Others have auto-reassign turned off so that the RAID controller can decide whether to first try rewriting the data back to the original location to see if the problem gets cleared, reassigning only if the problem persists (e.g. scratch on the media). In all cases, the RAID controller logs the statistics of unrecoverable errors. When such errors exceed a predetermined threshold, the drive will be scheduled to be replaced.
The above subsystem controller-managed data scrubbing has two disadvantages. First, the resources of the RAID system controller are required to manage and execute the scrubbing process. Second, data is returned from the drives even though the controller does not really need them. This generates unnecessary traffic on the bus and can potentially degrade system performance. Because of such limitations, the data scrub cycle may take longer than is desirable.
The present invention recognizes that the above problems can be ameliorated by implementing data scrubbing at a disk drive controller level that does not require any action by the RAID controller unless an unrecoverable error is encountered.