The extensive data storage needs of modern computer systems require large capacity mass data storage devices. A common storage device is the magnetic disk drive, a complex piece of machinery containing many parts which are susceptible to failure. A typical computer system will contain several such units. The failure of a single storage unit can be a very disruptive event for the system. Many systems are unable to operate until the defective unit is repaired or replaced, and the lost data restored.
As computer systems have become larger, faster, and more reliable, there has been a corresponding increase in need for storage capacity, speed and reliability of the storage devices. Simply adding storage units to increase storage capacity causes a corresponding increase in the probability that any one unit will fail. On the other hand, increasing the size of existing units, absent any other improvements, tends to reduce speed and does nothing to improve reliability.
Recently there has been considerable interest in arrays of direct access storage devices, configured to provide some level of data redundancy. Such arrays are commonly known as "RAIDs" (Redundant Array of Inexpensive Disks). Various types of RAIDs providing different forms of redundancy are described in a paper entitled "A Case for Redundant Arrays of Inexpensive Disks (RAID)", by Patterson, Gibson and Katz, presented at the ACM SIGMOD Conference, June, 1988. Patterson, et al., classify five types of RAIDs designated levels 1 through 5. The Patterson nomenclature has become standard in the industry. The underlying theory of RAIDs is that a large number of relatively small disk drives, some of which are redundant, can simultaneously provide increased capacity, speed and reliability.
Using the Patterson nomenclature, RAID levels 3 through 5 (RAID-3, RAID-4, RAID-5) employ parity records for data redundancy. Parity records are formed from the Exclusive-OR of all data records stored at a particular location on different storage units in the array. In other words, in an array of N storage units, each bit in a block of data at a particular location on a storage unit is Exclusive-ORed with every other bit at that location in a group of (N-1) storage units to produce a block of parity bits; the parity block is then stored at the same location on the remaining storage unit. If any storage unit in the array fails, the data contained at any location on the failing unit can be regenerated by taking the Exclusive-OR of the data blocks at the same location on the remaining devices and their corresponding parity block.
RAID-4 and RAID-5 are further characterized by independently operating read/write actuators in the storage units. In other words, each read/write head of a disk drive unit is free to access data anywhere on the disk, without regard to where other units in the array are accessing data. U.S. Pat. No. 4,761,785 to Clark et al., which is hereby incorporated by reference, describes a type of independent read/write array in which the parity blocks are distributed substantially equally among the storage units in the array. Distributing the parity blocks shares the burden of updating parity among the disks in the array on a more or less equal basis, thus avoiding potential performance bottlenecks that may arise when all parity records are maintained on a single dedicated disk drive unit. Patterson et al. have designated the Clark array RAID-5. RAID-5 is the most advanced level RAID described by Patterson, offering improved performance over other parity protected RAIDs.
One of the problems encountered with parity protected disk arrays having independent read/writes (i.e., RAID-4 or RAID-5) is the overhead associated with updating the parity block whenever a data block is written. Typically, as described in Clark, et al., the data block to be written is first read and the old data Exclusive-ORed with the new data to produce a change mask. The parity block is then read and Exclusive-ORed with the change mask to produce the new parity data. The data and parity blocks can then be written. Thus, two read and two write operations are required each time data is updated.
In a typical computer system, the central processing unit (CPU) operates much faster than the storage devices. The completion of the two read and two write operations by the storage devices which are necessary for updating data and parity require a comparatively long period of time in relation to CPU operations. If the CPU holds off further processing of a task until the data update in the storage devices is completed, system performance can be adversely affected. It is desirable to permit the CPU to proceed with processing a task immediately or shortly after transmitting data to the disk array for writing, while still maintaining data redundancy.
A single parity block of a RAID-3, RAID-4 or RAID-5 provides only one level of data redundancy. This ensures that data can be recovered in the event of failure of a single storage unit. However, the system must be designed to either discontinue operations in the event of a single storage unit failure, or continue operations without data redundancy. If the system is designed to continue operations, and a second unit fails before the first unit is repaired or replaced and its data reconstructed, catastrophic data loss may occur. In order to support a system that remains operational at all times, and reduces the possibility of such catastrophic data loss, it is possible to provide additional standby storage units, known as "hot spares". Such units are physically connected to the system, but do not operate until a unit fails. In that event, the data on the failing unit is reconstructed and placed on the hot spare, and the hot spare assumes the role of the failing unit. Although the hot spares technique enables a system to remain operational and maintain data redundancy in the event of a device failure, it requires additional storage units (and attendant cost) which otherwise serve no useful function.