1. Field of the Invention
The present invention relates to disk array storage systems, and particularly RAID systems. More particularly, the invention concerns error recovery in disk array storage systems in response to disk error conditions.
2. Description of the Prior Art
Disk array storage systems store data across multiple disk storage devices (disk drives) arranged to form one or more disk drive array groups. A common type of disk array storage system is the RAID (Redundant Array of Independent Disks) system. In such apparatus, a host computer connects to a RAID controller that manages one or more RAID intelligent disk enclosures, each containing a control processor and an array of disk drives mounted in individual device slots.
A disk error condition will result from a failure of one of the RAID disks to respond to the RAID controller's attempts to implement some action, such as a disk selection operation, or a command transfer or data/control transfer operation. Usually, there is an indication of a persistent error despite command or phase retries. A disk error condition may likewise manifest itself as a failure to continue an operation that has started.
The conventional approach to correcting a disk error condition in a RAID system is to issue a device or bus reset command if no response is received to a disk command within a pre-defined time limit, or after multiple command retries have failed. Following the reset attempt, the failed command is reissued. If the command again fails, the reset recovery action may be retried one or more times up to some specified retry limit. At that point, if there is still no response, the RAID controller marks the drive as “dead,” which means the drive is placed in offline status with further accesses being prohibited.
If a disk drive marked dead is a member of a redundant RAID array, the higher level logical drive associated with the faulty physical drive is still accessible. However, array performance is degraded due to the need to reconstruct data for the faulty drive when requested by the host. If the faulty drive is part of a non-redundant RAID array, the associated logical drive must be marked offline, making data access impossible.
Disk drive manufacturers receiving drives for repair that have been marked dead in a RAID system often find them quite operable. This condition is known as NTF (No Trouble Found). The drive manufacturers often complain to their customers about the mis-killings by RAID controllers. The RAID controller manufacturers, on the other hand, may defend their actions by referring to certain posted user-accessible events that indicate reasons for killing the disk drives.
What the RAID controllers actually failed to do in these disputed cases is to try to power cycle the disk drive and then retry the failing operation before killing the drive. On the other hand, disk drive manufacturers invariably try to power up the returned dead disk drive before diagnosing the problem. Certain transient or non-recurring disk drive ASIC errors, including a possible microprocessor hang, can be cleared if the unit is first powered off and then powered back on, causing the drive firmware to be reloaded from the media and the hardware to be restarted from scratch.
In a conventional RAID system enclosure, power cycling a single faulty disk drive may mean powering down the entire multi-disk drive enclosure and powering it back on. Alternatively, the faulty drive may need to be manually removed from its device slot and then be re-inserted into the slot, thereby simulating a disk drive power cycle. Neither of these manual actions is practical. The former action is particularly undesirable insofar as it involves taking multiple disk drives down at the same time. The latter action requires operator retraining insofar as RAID system operators are trained only to remove disk drives for the purpose of substituting in a new disk drive.
It is possible to retrain RAID system operators to manually remove a faulty disk drive and then re-insert it into the device slot. This action would simulate a disk drive power cycle and presumably clear the disk error condition. Then, in the case of a redundant RAID array housed in an intelligent enclosure, an auto rebuild operation would take place if no standby rebuild has already occurred. Alternatively, the re-inserted disk drive would become a new standby drive for a future rebuild. In the case of a non-redundant array in a RAID enclosure, a user-initiated data restoration would have to take place before the associated logical drive can be placed back in operation.
Although the foregoing manual action may correct the disk error condition, there are a number of associated problems. First, human operator intervention is required. Second, it is not practical to assume that the human operator is located where the problem is at the time a disk error condition arises. Third, and perhaps most importantly, the faulty disk drive is first marked “dead” before the operator intervention occurs. This means in a redundant RAID system that a data regeneration operation has already taken place, and that a physical rebuild operation must be implemented following the human intervention. In a non-redundant RAID system, logical drive operation will have already been terminated and a user-required data restore may have taken place.
Accordingly, a need exists for a disk array storage system in which disk error conditions can be resolved in a manner that avoids the foregoing disadvantages. What is particularly required is a disk error recovery procedure wherein a faulty disk drive can be tested for transient or non-recurring software, firmware or hardware errors that normal resets will not resolve. Preferably, such error recovery will be preformed prior to the faulty disk drive being marked as dead and taken offline.