1. Technical Field
The present invention relates generally to recovery from Unrecoverable Read errors on computer hard drives in RAID (Redundant Array of Independent Disks) configurations where the RAID functionality is provided by the system processor. More specifically this invention relates to the reporting and recovery of errors using the Small Computer System Interface (SCSI) protocol.
2. Description of Related Art
Computer systems are often arranged with redundant data storage in order to permit recovery of lost data, for example, from damaged media. Currently RAID controllers initiate background read operations on the hard drives attached to them in order to find locations on the media that may have been damaged, causing either hard data errors or recoverable data errors that require significant levels of Error Recovery. This functionality is called data scrubbing. If a hard error is encountered during scrubbing, the bad Logical Block Address (LBA) is reassigned and when the drive is a member of a RAID configuration (other than RAID 0), any lost data can be recreated and rewritten. Thus RAID data redundancy is maintained. This is usually accomplished transparent to the operating system (OS), application programs, and the user.
When a hard drive is attached to a host via a “just a bunch of disks (JBOD) host bus adapter, the adapter does not initiate this background data scrubbing activity. When JBOD drives are configured as RAID arrays where the RAID functionality is provided by the host CPU and the OS, rather than using a RAID adapter, the background scrubbing functionality is usually not included. This is mainly because significant system resources would be consumed to perform background data scrubbing on all the hard drive resources attached to it. Thus, in a system configuration where the OS provides RAID functionality (e.g., acts as RAID controller), if a drive in the RAID array fails followed by encountering a hard media error during the rebuild process, the rebuild will fail because the array was already running exposed (i.e., with no redundancy). For example, this can occur when using the IBM Advanced Interactive Operating system (AIX) Mirroring (RAID 1) that us used on an IBM eServer pSeries System. Further, some errors are not discovered during normal operation, because a hard error may occur in an LBA containing infrequently used data. In such cases a maintenance window has to be scheduled reasonably quickly so that the system can be brought down and a RAID 1 array can be recreated from backup tapes. Such issues are unacceptable in systems requiring high reliability.