1. Technical Field
The present invention relates to replacing failed storage devices. More particularly, the invention concerns using redundant spare storage devices to reduce the rebuild time when replacing a failed storage device in a storage device array.
2. Description of Related Art
Important data is frequently stored in storage devices, such as hard disk drives, used in computing systems. Consequently, it is desirable to reduce the probability of data being lost if a storage device fails.
Techniques that have been utilized to reduce the probability of data being lost when a storage device fails include storing parity information on another storage device, and making a duplicate copy of data on another storage device (data mirroring). If a storage device fails, parity information may be used to reconstruct the data that was on the failed storage device. If data mirroring is used, a duplicate copy of data that was on the failed storage device can be retrieved from another storage device.
A Redundant Array of Inexpensive (or Independent) Disks (RAID), may be used to provide a data storage system that has increased performance and capacity. Data mirroring and parity information storage may be implemented on a RAID. Also, a technique called striping, in which data (and possibly parity information) is divided into blocks and is stored on different disks, may be used with a RAID to balance the load across the disks and to improve performance. Several RAID protocols have been devised wherein different mirroring, parity, and striping arrangements are employed. As an example, in RAID 5, data and parity information are striped across a number of disks. RAID 5 provides a redundancy of one, which means that data can be recovered after the failure of one storage device.
For storage systems that have a redundancy of one, there is a period of time, referred to as a single point of failure time window, during which the data on the entire array can be lost if a second storage device fails. The single point of failure time window begins when a storage device in a storage array fails, and continues for the time required to reliably rebuild the data that was stored on the failed storage device, on a spare storage device. In a similar but less extreme data loss scenario, a sector of data can be lost if any surviving storage device or the spare storage device suffers an unrecoverable read error during the rebuild time. For storage systems that have a redundancy of two, a single point of failure time window begins if two storage devices are simultaneously in a failed condition.
The probability of data being lost due to a subsequent storage device failure during a rebuild during a single point of failure time window, is proportional to the time required for the rebuild. Accordingly, it is desirable to reduce the rebuild time. Generally, larger drives take longer to rebuild than smaller drives. The time required for a rebuild may be, for example, as long as several hours. Many of the techniques currently employed when rebuilding data on a spare disk prolong the rebuild time. For example, write verify operations extend the time required to complete a rebuild. Also, with known rebuild techniques, data is written to only a single spare disk, which can result in delay if there is an error while writing to the spare disk. Consequently, current rebuild techniques are not completely adequate.