1. Field of the Invention
The present invention is related to the field of error correction techniques for an array of disks.
2. Background Art
A computer system typically requires large amounts of secondary memory, such as a disk drive, to store information (e.g. data and/or application programs). Prior art computer systems often use a single xe2x80x9cWinchesterxe2x80x9d style hard disk drive to provide permanent storage of large amounts of data. As the performance of computers and associated processors has increased, the need for disk drives of larger capacity, and capable of high speed data transfer rates, has increased. To keep pace, changes and improvements in disk drive performance have been made. For example, data and track density increases, media improvements, and a greater number of heads and disks in a single disk drive have resulted in higher data transfer rates.
A disadvantage of using a single disk drive to provide secondary storage is the expense of replacing the drive when greater capacity or performance is required. Another disadvantage is the lack of redundancy or back up to a single disk drive. When a single disk drive is damaged, inoperable, or replaced, the system is shut down.
One prior art attempt to reduce or eliminate the above disadvantages of single disk drive systems is to use a plurality of drives coupled together in parallel. Data is broken into chunks that may be accessed simultaneously from multiple drives in parallel, or sequentially from a single drive of the plurality of drives. One such system of combining disk drives in parallel is known as xe2x80x9credundant array of inexpensive disksxe2x80x9d (RAID). A RAID system provides the same storage capacity as a larger single disk drive system, but at a lower cost. Similarly, high data transfer rates can be achieved due to the parallelism of the array.
RAID systems allow incremental increases in storage capacity through the addition of additional disk drives to the array. When a disk crashes in the RAID system, it may be replaced without shutting down the entire system. Data on a crashed disk may be recovered using error correction techniques.
RAID Arrays
RAID has six disk array configurations referred to as RAID level 0 through RAID level 5. Each RAID level has advantages and disadvantages. In the present discussion, only RAID levels 4 and 5 are described. However, a detailed description of the different RAID levels is disclosed by Patterson, et al. in A Case for Redundant Arrays of Inexpensive Disks (RAID), ACM SIGMOD Conference, June 1988. This article is incorporated by reference herein.
RAID systems provide techniques for protecting against disk failure. Although RAID encompasses a number of different formats (as indicated above), a common feature is that a disk (or several disks) stores parity information for data stored in the array of disks. A RAID level 4 system stores all the parity information on a single parity disk, whereas a RAID level 5 system stores parity blocks throughout the RAID array according to a known pattern. In the case of a disk failure, the parity information stored in the RAID subsystem allows the lost data from a failed disk to be recalculated.
FIG. 1 is a block diagram illustrating a prior art system implementing RAID level 4. The system comprises N+1 disks 112-118 coupled to a computer system, or host computer, by communication channel 130. In the example, data is stored on each hard disk in 4 KByte (KB) blocks or segments. Disk 112 is the Parity disk for the system, while disks 114-118 are Data disks 0 through Nxe2x88x921. RAID level 4 uses disk xe2x80x9cstripingxe2x80x9d that distributes blocks of data across all the disks in an array as shown in FIG. 1. A stripe is a group of data blocks where each block is stored on a separate disk of the N disks along with an associated parity block on a single parity disk. In FIG. 1, first and second stripes 140 and 142 are indicated by dotted lines. The first stripe 140 comprises Parity 0 block and data blocks 0 to Nxe2x88x921. In the example shown, a first data block 0 is stored on disk 114 of the N+1 disk array. The second data block 1 is stored on disk 116, and so on. Finally, data block Nxe2x88x92i is stored on disk 118. Parity is computed for stripe 140 using well-known techniques and is stored as Parity block 0 on disk 112. Similarly, stripe 142 comprising N data blocks is stored as data block N on disk 114, data block N+1 on disk 116, and data block 2Nxe2x88x921 on disk 118. Parity is computed for the 4 stripe 142 and stored as parity block 1 on disk 112.
As shown in FIG. 1, RAID level 4 adds an extra parity disk drive containing error-correcting information for each stripe in the system. If an error occurs in the system, the RAID array must use all of the drives in the array to correct the error in the system. RAID level 4 performs adequately when reading small pieces of data. However, a RAID level 4 array always uses the dedicated parity drive when it writes data into the array.
RAID level 5 array systems also record parity information. However, it does not keep all of the parity sectors on a single drive. RAID level 5 rotates the position of the parity blocks through the available disks in the disk array of N+1 disk. Thus, RAID level 5 systems improve on RAID 4 performance by spreading parity data across the N+1 disk drives in rotation, one block at a time. For the first set of blocks, the parity block might be stored on the first drive. For the second set of blocks, it would be stored on the second disk drive. This is repeated so that each set has a parity block, but not all of the parity information is stored on a single disk drive. In RAID level 5 systems, because no single disk holds all of the parity information for a group of blocks, it is often possible to write to several different drives in the array at one instant. Thus, both reads and writes are performed more quickly on RAID level 5 systems than RAID 4 array.
FIG. 2 is a block diagram illustrating a prior art system implementing RAID level 5. The system comprises N+1 disks 212-218 coupled to a computer system or host computer 120 by communication channel 130. In stripe 240, parity block 0 is stored on the first disk 212. Data block 0 is stored on the second disk 214, data block 1 is stored on the third disk 216, and so on. Finally, data block Nxe2x88x921 is stored on disk 218. In stripe 212, data block N is stored on the first disk 212. The second parity block 1 is stored on the second disk 214. Data block N+1 is stored on disk 216, and so on. Finally, data block 2Nxe2x88x921 is stored on disk 218. In Mxe2x88x921 stripe 244, data block MNxe2x88x92N is stored on the first disk 212. Data block MNxe2x88x92N+1 is stored on the second disk 214. Data block MNxe2x88x92N+2 is stored on the third disk 216, and so on. Finally, parity block Mxe2x88x921 is stored on the nth disk 218. Thus, FIG. 2 illustrates that RAID level 5 systems store the same parity information as RAID level 4 systems, however, RAID level 5 systems rotate the positions of the parity blocks through the available disks 212-218.
In RAID level 5, parity is distributed across the array of disks. This leads to multiple seeks across the disk. It also inhibits simple increases to the size of the RAID array since a fixed number of disks must be added to the system due to parity requirements.
The prior art systems for implementing RAID levels 4 and 5 have several disadvantages. The first disadvantage is that, after a system failure, the parity information for each stripe is inconsistent with the data blocks stored on the other disks in the stripe. This requires the parity for the entire RAID array to be recalculated. The parity is recomputed entirely because there is no method for knowing which parity blocks are incorrect. Thus, all the parity blocks in the RAID array must be recalculated. Recalculating parity for the entire RAID array is highly time consuming since all of the data stored in the RAID array must be read. For example, reading an entire 2 GB disk at maximum speed takes 15 to 20 minutes to complete. However, since few computer systems are able to read very many disks in parallel at maximum speed, recalculating parity for a RAID array takes even longer.
One technique for hiding the time required to recompute parity for the RAID array is to allow access to the RAID array immediately, and recalculate parity for the system while it is on-line. However, this technique suffers two problems. The first problem is that, while recomputing parity, blocks having inconsistent parity are not protected from further corruption. During this time, a disk failure in the RAID array results in permanently lost data in the system. The second problem with this prior art technique is that RAID subsystems perform poorly while calculating parity. This occurs due to the time delays created by a plurality of input/output (I/O) operations imposed to recompute parity.
The second disadvantage of the prior art systems involves writes to the RAID array during a period when a disk is not functioning. Because a RAID subsystem can recalculate data on a malfunctioning disk using parity information, the RAID subsystem allows data to continue being read even though the disk is malfunctioning. Further, many RAID systems allow writes to continue although a disk is malfunctioning. This is disadvantageous since writing to a broken RAID array can corrupt data in the case of a system failure. For example, a system failure occurs when an operating system using the RAID array crashes or when a power for the system fails or is interrupted otherwise. Prior art RAID subsystems do not provide protection for this sequence of events.
The present invention is a method for providing error correction for an array of disks using non-volatile random access memory (NV-RAM).
Non-volatile RAM is used to increase the speed of RAID recovery from disk error(s). This is accomplished by keeping a list of all disk blocks for which the parity is possibly inconsistent. Such a list of disk blocks is smaller than the total number of parity blocks in the RAID subsystem. The total number of parity blocks in the RAID subsystem is typically in the range of hundreds of thousands of parity blocks. Knowledge of the number of parity blocks that are possibly inconsistent makes it possible to fix only those few blocks, identified in the list, in a significantly smaller amount of time than is possible in the prior art. The present invention also provides a technique of protecting against simultaneous system failure and a broken disk and of safely writing to a RAID subsystem with one broken disk.