The present invention relates to maintaining parity information on computer data storage devices and in particular to maintaining availability of a computer system when reconstricting data from a failed storage device.
The extensive data storage needs of modern computer systems require large capacity mass data storage devices. A common storage device is the magnetic disk drive, a complex piece of machinery containing many parts which are susceptible to failure. A typical computer system will contain several such units. As users increase their need for data storage, systems are configured with larger numbers of storage units. The failure of a single storage unit cain be a very disruptive event for the system. Many systems are unable to operate until the defective unit is repaired or replaced, and the lost data restored. An increased number of storage units increases the probability that any one unit will fail, leading to system failure. At the same time, computer users are relying more and more on the consistent availability of their systems. It therefore becomes essential to find improved methods of reconstructing data contained on a failing storage unit, and sustaining system operations in the presence of a storage unit failure.
One method of addressing these problems is known as "mirroring". This method involves maintaining a duplicate set of storage devices, which contains the same data as the original. The duplicate set is available to assume the task of providing data to the system should any unit in the original set fail. Although very effective, this is a very expensive method of resolving the problem since a customer must pay for twice as many storage devices.
A less expensive alternative is the use of parity blocks. Parity blocks are records formed from the Exclusive-OR of all data records stored at a particular location on different storage units. In other words, each bit in a block of data at a particular location on a storage unit is Exclusive-ORed with every other bit at that same location in each storage unit in a group of units to produce a block of parity bits; the parity block is then stored at the same location on another storage unit. If any storage unit in the group fails, the data contained at any location on the failing unit can be regenerated by taking the Exclusive-OR of the data blocks at the same location on the remaining devices and their corresponding parity block.
U.S. Pat. No. 4,092,732 to Ouchi describes a parity block method. In the Ouchi device, a single storage unit is used to store parity information for a group of storage devices. A read and a write on the storage unit containing parity blocks occurs each time a record is changed on any of the storage units in the group covered by the parity record. Thus, the storage unit with the parity records becomes a bottleneck to storage operations. U.S. Pat. No. 4,761,785 to Clark et al., which is hereby incorporated by reference, improves upon storage of parity information by distributing parity blocks substantially equally among a set of storage units. N storage units in a set are divided into a multiple of equally sized address blocks, each containing a plurality of records. Blocks from each storage unit having the same address ranges form a stripe of blocks. Each stripe has a block on one storage device containing parity for the remaining blocks of the stripe. The parity blocks for different stripes are distributed among the different storage units in a round robin manner.
The use of parity records as described in the Ouchi and Clark patents substantially reduces the cost of protecting data when compared to mirroring. However, while Ouchi and Clark teach a data recovery or protection means, they do not provide a means to keep a system operational to a user during data reconstruction. Normal operations are interrupted while a memory controller is powered down to permit a repair or replacement of the failed storage device, followed by a reconstruction of the data. Since this prior art relies exclusively on software for data reconstruction, the system can be disabled for a considerable time.
Prior art does not teach dynamic system recovery and continued operation without the use of duplicate or standby storage units. Mirroring requires a doubling of the number of storage units. A less extreme approach is the use of one or more standby units, i.e., additional spare disk drives which can be brought on line in the event any unit in the original set fails. Although this does not entail the cost of a fully mirrored system, it still requires additional storage units which otherwise serve no useful function.
It is therefore an object of the present invention to provide an enhanced method and apparatus for recovering from data loss in a computer system having multiple data storage units,
It is a further object of this invention to provide an enhanced method and apparatus whereby a computer system having multiple data storage units may continue to operate if one of the data storage units fails,
Another object of this invention is to reduce the cost of protecting data in a data processing system having multiple protected storage units,
A still further object of this invention is to increase the performance of a computer system having multiple data storage units when one of the data storage units fails and the system must reconstruct the data contained on the failed unit,