A. Technical Field
The present invention relates generally to disk array storage systems, and more particularly, to a method for power failure recovery of pending write operations to a disk array storage system.
B. Background of the Invention
Disk arrays comprising a multiplicity of small inexpensive disk drives connected in parallel have emerged as a low cost alternative to the use of single large disks for non-volatile storage of information. The disk array appears as a single large virtual disk drive to a host system and offers improvements in performance, reliability, power consumption and scalability over a single large magnetic disk. In addition to data, redundancy information is stored within the array so that if any single disk, or portion thereof, within the array should fail, the disk array continues to function without the loss of data. An example of such a disk array is a Redundant Array of Independent Disks (“RAID”).
The way of storing the data on a RAID depends on the disk array arrangements. There are several disk array arrangements referred to as RAID levels. A RAID level 1 system comprises one or more disks for storing data and an equal number of additional “mirror” disks for storing copies of the information written to the data disks. The other RAID levels, identified as RAID level 2, 3, 4 and 5 systems, segment the data into portions for storage across several data disks. One or more additional disks are utilized to store error check or parity information either dedicatedly or in combination with the data.
FIG. 1 illustrates an exemplary RAID system according to one embodiment of the invention. The RAID system includes a RAID controller 110 that functions as an interface between a host or operating system and an array of disks 120. The array of disks contains a first drive 130, a second drive, and up to an Nth drive 150 on which data is stored. The RAID controller controls the read and write operations of this data into the RAID 120.
In RAID implementations, writes to disks involve not only writing to data strips; but also generating parity and writing it to a parity strip. The data, in form of blocks, is spread among various disks (N in number) and the parity (or redundancy) information generated from the data can be stored in a dedicated disk or can be spread across all the disks, as is implemented in a RAID 5 implementation.
FIG. 2 illustrates an exemplary fault-tolerant RAID 5 in which data and parity is striped intermittently across three or more physical disks according to one embodiment of the invention. As shown, a RAID controller 210 causes multiple write operations 211 to multiple drives 220, 230, 240, 250 in order to store data and provide redundancy thereof. For example, data may be stored in data files or stripes across multiple drives. In this particular example, data is written to a first file 221 on the first drive 220; a second file 231 on the second drive 230; a third file 241 on the third drive 240; and a fourth file 251 on the Nth drive 250. A parity stripe or file 222 is written on the first drive 220.
If a portion or complete failure of a physical disk occurs, the data that was lost or corrupted may be re-created from the remaining data and parity. A single parity strip gets calculated for each strip set. Within each strip set, the parity strip is stored on a different disk, so that there is no single parity-only drive for various reasons including that a dedicated parity drive would represent a bottleneck. This striping of data as well as parity helps in the recreation of data in case of a failed drive scenario.
Data inconsistencies may be generated between data strips and parity in the event of a failure, such as a power failure within the disk array. In particular, data corruption can occur if data and parity become inconsistent due to the array failure, resulting in a false regeneration when data from the failed member disk is subsequently requested by an application. When the writes are issued for data and parity (or mirror) strips, each write is typically performed as an independent write to different disk having no correlation or synchronization there between. An occurrence of power failure in between the two writes leads to inconsistency between the data and the parity (or mirror).
FIG. 3 illustrates an exemplary write operation to the disk array that was interrupted by a power failure. In response to a write command, data D1 320 is written to a disk at time T1 and corresponding parity information P 330 is updated on the disk at time T2. If a power failure 310 occurs between time T1 and time T2, the data information written to the disk is updated but parity remains unchanged thereby creating an inconsistency between the data and the parity information associated therewith.
Generally, power failure recovery systems involve keeping track of writes to logical drive by setting a flag in a non-volatile memory while the disk write operation is executing so that in case of a power failure intermittently between individual writes to the disks there is a means of signaling the processor of incomplete writes. The procedure further involves re-computation of parity to make it consistent with data by a running consistency check on the complete drives since no information identifying the disk for which the write was intended is present.
Other methods of power failure recovery include marking the drive inconsistent at boot time, to reduce the computation on parity consistency over complete drive if writes did not complete before the power failure happened. The identified logical drive is then checked for consistency wherein the parity is recomputed if the stripe is found to be inconsistent. However, in a degraded RAID 5 scenario that involves updating data as well as parity, a power failure recovery system as described above renders itself inapplicable because of insufficiency of information on reconstructing data due to the inconsistency in data present because of the degraded disk.
Therefore, a method for safeguarding disk array write operations is required to compensate for inconsistencies between data and parity caused by a power failure prior to completion of a write procedure.