1. Technical Field
The present invention relates generally to the data processing field and, more particularly, to a method, system and computer program product for reporting and recovering from uncorrectable data errors in a data processing system having an operating system implemented RAID environment and using the Advanced Technology Attachment (ATA) or the Serial ATA (SATA) protocol.
2. Description of Related Art
Data processing systems are often arranged with redundant data storage in order to permit recovery of lost data, for example, from damaged media. Currently, RAID (Redundant Array of Independent Disks) controllers initiate background read operations on attached hard drives in order to find locations on the media that may have been damaged, causing either hard data errors or recoverable data errors that require significant levels of Error Recovery. This functionality is called data scrubbing. If a hard error is encountered during data scrubbing, the bad Logical Block Address (LBA) is reassigned and when the drive is a member of a RAID configuration (other than RAID 0), any lost data can be recreated and rewritten. Thus RAID data redundancy is maintained. This is usually accomplished transparent to the operating system (OS), application programs, and the user.
When a hard drive is attached to a host via a “just a bunch of disks (JBOD) host bus adapter, the adapter does not initiate this background data scrubbing activity. When JBOD drives are configured as RAID arrays where the RAID functionality is provided by the host CPU and the OS, rather than by using a RAID adapter, the background scrubbing functionality is usually not included. This is mainly because significant system resources would be consumed to perform background data scrubbing on all the hard drive resources attached to the system. Thus, in a system configuration where the OS provides RAID functionality (e.g., acts as a RAID controller), if a drive in the RAID array fails followed by encountering a hard media error during the rebuild process, the rebuild will fail because the array was already running exposed (i.e., with no redundancy). For example, this can occur when using the IBM Advanced Interactive Operating system (AIX) Mirroring (RAID1) that is used on an IBM eServer pSeries System. Further, some errors are not discovered during normal operation because a hard error may occur in an LBA containing infrequently used data. In such cases a maintenance window has to be scheduled reasonably quickly so that the system can be brought down and a RAID 1 array can be recreated from backup tapes. Such issues are unacceptable in systems requiring high reliability.
The above-referenced related application describes mechanisms for reporting and recovering from uncorrectable data errors in a data processing system in which the hard drive is connected to the system using the Small computer System Interface (SCSI) protocol. It would be desirable to provide a mechanism for reporting and recovering from uncorrectable data errors in a data processing system in which the hard drive is connected to the system using the Advanced Technology Attachment (ATA) or the Serial ATA (SATA) protocol.