The present invention relates to redundant arrays of independent disks (RAID), and in particular to systems for recovery from a failure during a write operation.
RAID systems allow data to be written in stripes across multiple disk drives. When the data is written, parity data is written on the last disk drive, or in some RAID systems on two additional disk drives. The parity data is a combination, such as by exclusive-ORing the data on the other disks together. By storing this parity data, error recovery can be accomplished if one of the data drives fails or is removed. Using the parity data and the remaining data on the other disk drives, the data from the failing disk drive can be regenerated.
When new data is written to only one or a few of the disks in a stripe of disk drives, a Read-Modify-Write operation is necessary. The read operation reads all disks from the disk drives and parity. The modify operation writes over one or more of the disks with the new data and calculates a new parity. The write operation then writes the newly calculated data and parity back onto the disk drives. In the event of a failure during this operation, it is necessary to take appropriate steps to avoid the loss of data.
One step to avoid conflicts is to lock out the span of disk drives being read, so that another operation does not attempt to read or modify the data in that span while the Read-Modify-Write operation is proceeding. In addition, a cache is typically used to store the old and new data so that upon a failure during the operation, the data is still available for performing the operation. For example, the old data can be maintained in cache while the new data is being written to ensure that it is available in case of a failure during a write.
In a typical operation, both the data and the parity are written in parallel to the multiple disk drives. Until the write is completed, the RAID controller cannot tell the host system that the operation has been completed. In order to improve speed, some systems use what is called an xe2x80x9cearly commit.xe2x80x9d This means telling the host system that the write has been completed, before it has actually been completed. Such an early commit is possible only if the appropriate data has been saved in cache or otherwise to allow completion of the operation in the event of a failure during the write operation. There are a number of reasons for using an early commit. For example, another operation may be using the parity drive ahead of this particular Read-Modify-Write, thus causing further delay in the actual write, which is avoided by an early commit.
It would be desirable to minimize use of non-volatile cache for an extended period of time to save data for error recovery. It would also be desirable to minimize the amount of time before the write controller can tell the host that the write has been completed. In addition, it would be desirable to minimize the amount of time a span of disk drives needs to be locked off from other operations.
The present invention improves upon the prior art Read-Modify-Write operation to achieve the above desirable objectives. This is done by separating the writing of new data and the writing of new parity, so that they are not done in parallel. This allows recovery to be performed at all stages, without requiring the excessive use of non-volatile cache, and minimizing the amount of time a span has to be locked. The invention accomplishes this by allowing one of the recoveries to be a recovery to the old data, before the write, with a signal to the host that the write operation failed in such a recovery situation, requiring the host to resend the data. This speeds up the entire operation while minimizing the use of resources while only requiring that in the rare instances of a failure during a particular part of the Read-Modify-Write, the host needs to resend the data.
In one embodiment, the new data is written before the new parity is written. In the event of a failure during the writing of the new data, the recovery is the writing of old data. In the event of a failure during the writing of the parity, the recovery can write the new parity again. Thus, only a failure during the writing of the new data requires a recovery with a write failure sent back to the host.
An alternate embodiment can have the new parity written before the writing of new data.
In addition, rather than locking the span starting from the reading of the old data and old parity, the present invention only requires the span to be locked after the new parity is calculated (and before the new data and new parity is written) until after the writing of the new parity and new data.
For a further understanding of the nature and advantages of the invention, reference should be made to the following description taken in conjunction with the accompanying drawings.