Most modem, mid-range to high-end disk storage subsystems are arranged as redundant arrays of independent disks (RAID). A number of RAID levels are known. RAID-1 includes sets of N data disks and N mirror disks for storing copies of the data disks. RAID-3 includes sets of N data disks and one parity disk. RAID-4 also includes sets of N+1 disks, however, data transfers are performed in multi-block operations. RAID-5 distributes parity data across all disk drives in each set of N+1 disk drives. At any level, it is desired to have RAID subsystems where an input/output (I/O) operation can be performed with minimal operating system intervention.
One of the most important aspects of any RAID subsystem is its ability to withstand a disk drive failure. To implement this feature, the disk drives used by the RAID subsystem must have some amount of data duplicated. This data is the “redundant” data, and RAID levels 1, 10, 5 and 50 are some of the more popular RAID levels because of the redundancy provided. With redundant data, any one of the disk drives in the RAID array can fail, while still ensuring complete data integrity. When a disk drive does fail, the RAID subsystem takes the redundant data, and uses it to reconstruct all of the data originally stored onto the array. While the RAID subsystem is doing this failure recovery, the RAID array is operating in a “degraded” state. For most RAID levels, a second disk drive failure could result in some data loss for the user.
However, when a RAID subsystem is operating in a degraded state, the risk of losing data is much greater. Therefore, RAID subsystems attempt to minimize the time that the array operates in the degraded state. When a new disk drive is added to an array, the RAID subsystem regenerates redundant data in a process known as “rebuilding the array.” The rebuild process can take several hours to complete. If user intervention is required to start the rebuild process, rebuilding may not complete until several days have passed. Having a RAID array in the degraded state for several days puts the integrity of the data at great risk.
To work around the problem of requiring user intervention, most RAID subsystems implement use what are called “hot spare” disk drives. With hot spares disk drives, an extra disk drive is set aside in “stand-by mode” to allow the rebuild process to start the instant a disk drive failure is detected.
However, a hot spare is an attached disk drive that does not get used except in the event of a disk drive failure. This is a waste of a disk drive that could otherwise be used to increase performance while the array is not operating in the degraded state.
Another way to allow the immediate start of a rebuild operation is to change the RAID level of the array to one that has less redundancy, and, therefore uses fewer disk drives. While this is useful, it will also leave the array in a state that has less redundancy than the user originally wanted after the rebuild completes, see for example, U.S. Pat. No. 5,479,653 issued to Jones on Dec. 26, 1995 “Disk array apparatus and method which supports compound raid configurations and spareless hot sparing.”
Therefore, there is a need for a RAID subsystem that can rebuild the array to an equivalent level of redundancy without requiring a spare standby disk drive. In addition it is desire that the subsystem can tolerate multiple failures.