1. Technical Field
The present invention relates to a system for storing and retrieving data including a self healing disk drive in a storage system.
2. Description of Related Art
In computer data storage networks, a host processor is typically connected to a storage subsystem. Storage subsystems may include a storage controller and a plurality of interconnected disk drives known as a Redundant Array of Inexpensive Disks (RAID), also known as RAID disk drive arrays or disk drive arrays. The storage controller may include one or more processors and one or more device adapters. The host processor could be a workstation or a server, such as a bank tellers computer or a computer operated by airline employees at an airport. The host processor directs the processors of the storage controller to instruct the adapters to write data to the disk drive arrays and read the data from the disk drive arrays. For example, a data string is stored in the disk drive arrays and the data string includes a subsection of the data string stored on a given disk drive that is a member of the array (hereinafter referred to as the first data string subsection of a first data string). The host processor may request that the storage controller of the storage subsystem read the first data string from the disk drive arrays. When the first data string is read from the disk drive arrays, the first data string may be incomplete because its first data string subsection is defective. Therefore, the first data string is temporarily lost. However, data associated with the first data string subsection of the first data string can be reconstructed and recovered via parity information that is also stored with the data string. When the data associated with the first data string subsection of the first data string is reconstructed, the host can now receive a complete version of the first data string. When the data associated with the first data string subsection is reconstructed, the reconstructed first data string subsection is restored in a new location of the given disk drive in the disk drive array.
In conventional storage systems, nothing further is done beyond the reconstructing and recovering the data associated with the first data string subsection of the first data string that was requested by the host processor. If a radial or a spiral scratch exists on the particular disk drive of the disk drive array, the scratch may damage a plurality of additional data string subsections of additional data strings. Since nothing further is done beyond the step of reconstructing and recovering the data associated with the first data string subsection of the first data string, the data associated with the additional data string subsections of the additional data strings are not immediately reconstructed and recovered at the same time the data associated with the first data string subsection of the first data string is reconstructed and recovered.
If the data associated with the additional data string subsections of the additional data strings are not immediately reconstructed and recovered at the same time when the data associated with the first data string subsection of the first data string was reconstructed and recovered, and when one of the additional data strings are subsequently read from the disk drive array, a single point of failure can occur in connection with that one additional data string if and when two or more defective data string subsections exist in that one additional data string.
A single point of failure can occur in at least two situations: (1) one of the RAID array disk drives no longer respond while another disk drive has a media defect, such as a media scratch leading to a hard read error in a data string subsection, or (2) when two drives in the RAID array had defects located in their respective data string subsections for a given data string. In both situations, the RAID parity information is no longer sufficient to recover the missing data string subsections.
Therefore, when the data associated with a first defective data string subsection on a particular disk drive of a disk drive array is located and reconstructed and recovered, it is necessary to immediately inspect adjacent areas around the first defective data string subsection on the particular disk drive of the disk drive array for the purpose of locating additional defects, such as an additional defective data string subsection of an additional data string, and immediately reconstructing and recovering the data associated with the additional defective data string subsection in addition to the data associated with the first defective data string subsection. This action is necessary in order to avoid the occurrence of a single point of failure when the additional data string is subsequently read from a RAID disk drive array of a storage subsystem.