1. FIELD OF THE INVENTION
The present invention relates to the field of data storage systems, and to a method and apparatus for providing real time reconstruction of corrupted data in a redundant array data storage system.
2. BACKGROUND ART
A typical data storage system involves one or more data storage units that provide data storage and retrieval for a computer or other data processing device. Data storage units may include high capacity tape drives, solid state memory chips, and magnetic, optical, and/or magneto-optical disk drives.
Data storage applications, such as on-line banking systems or video file systems, require nearly 100% reliability of a data storage system. That is, any data sent to the data storage system must be accurately stored, and the data must be accurately delivered from the storage system when requested. Present data storage units are not 100% reliable--instead, they generally have a statistically predictable rate of failure. Such failures range from a localized failure such as the corruption of a single bit of data to a complete failure of the data storage unit.
The consequences of storage system failures range from irretrievable loss of data to delays in the delivery of data while corrupted data is recovered or reconstructed. Irretrievable loss of data is a severe problem for any system. Delays in the delivery of data as it is being recovered or reconstructed may also have severe consequences, particularly where the data that is being delivered by the data storage system consists of long, continuous streams of data whose integrity depends on delivery of the data at a constant rate.
An example of a system in which interruptions in the delivery of data produce unacceptable results is a video file system used for on-demand delivery of full length video programs. In such a system, hundreds of full length videos are stored in digital form in a multi-terabyte data storage system. Video data is retrieved in real time at a customer's request, and delivered via a communications network to the customer for viewing. In this type of system, interruptions in the continuous flow of video data from the data storage system to the customer cause "blips" or other unacceptable deterioration in the quality of the video image being delivered. To deliver a satisfactory product to the end-user, such interruptions must be avoided.
It has been proposed that data storage reliability can be improved by generating error correction information and/or using redundant data storage units. For example, U.S. Pat. No. 5,208,813 to Stallmo for "On-Line Reconstruction of a Failed Redundant Array System" describes a number of different approaches to providing reliable data storage using arrays of redundant, inexpensive, disks. Five different architectures are described under the acronym "RAID" ("Redundant Arrays of Inexpensive Disks").
RAID 1 provides a "mirrored" storage unit for each primary storage unit that keeps a duplicate copy of all data on the primary unit. While RAID 1 systems provide increased reliability, they double the cost of data storage.
In RAID 2 systems, each bit of each word of data, plus "Error Detection and Correction" ("EDC") bits for each word are stored on a separate storage unit (this technique is known as "bit striping"). A RAID 2 system has reliability and a high data transfer bandwidth, since essentially an entire block of data is transferred during the time each disk drive needs to transfer a single bit. However, disadvantages of a RAID 2 system include the large number of disks needed and the high ratio of error-detection-and-correction bit storage disks to data storage disks. In addition, because each of the disks is accessed essentially in unison to read or write a block of data, effectively there is only a single actuator for all disks. As a result, the performance of the system for random reads of small files is degraded.
RAID 3 systems are based on the idea that a typical storage unit has internal means for determining data or system errors. Accordingly, the location of an error can be determined by the storage unit itself, and parity checking, a simpler form of error correction, can be used. RAID 3 systems, like RAID 2 systems, use a separate storage unit to store each bit in a word of data. The contents of these storage units are "Exclusive OR'd" ("XOR'd") to generate parity information, which is stored on a single extra storage unit. If any storage unit fails, the data on the failed storage unit can be reconstructed by XOR'ing the data on the remaining storage units with the parity information. RAID 3 systems require a smaller ratio of redundancy storage units to data storage units than RAID 2 systems. However, because data is stored bitwise, RAID 3 systems suffer the same performance degradation as RAID 2 systems for random reads of small files.
RAID 4 systems improve on the performance of RAID 3 systems by parceling data among the data storage units in amounts larger that the single bits used in RAID 3 systems. In RAID 4 systems, the size of such a "block" of data is typically a disk sector. Parceling data out in such blocks is also referred to as "block striping." For each "stripe" of data, a parity block is stored on a single, extra, storage unit designated as the parity unit.
A limitation of RAID 4 systems is that every time data is written to any of the independently operating data storage units, new parity information must also be written to the parity unit. The parity information stored on the parity unit must be read and XOR'd with the old data (to remove the information content of the old data in the parity data), and the resulting sum must be XOR'd with the new data (to calculate the new parity information). The new data and the new parity information must then be written to the respective data and parity storage units. This process is referred to as a "Read-Modify-Write" process.
Accordingly, a read and a write occur at the parity unit every time a data record is changed on any of the data storage units. Thus the parity unit becomes a potential bottleneck in a RAID 4 system.
RAID 5 systems use the same block size and parity error correction concepts as RAID 4 systems. However, instead of having a single storage unit dedicated to storing parity information, in a RAID 5 systems, the parity information is distributed among all the storage units in the system.
RAID 5 systems use the concept of a "redundancy group." A "redundancy group" is a set of "N+1" storage units. Each of the storage units is divided into a number of equally sized address areas called "blocks." Each storage unit usually contains the same number of such blocks. Blocks from each storage unit in a redundancy group having the same range of addresses are called "stripes." Each "stripe" of blocks in the redundancy group contains "N" blocks of data and one ("+1") block of parity data. The location of the block of parity data changes from one stripe to the next. For example, for a RAID 5 system with a redundancy group consisting of five disk drives, the parity data for the first stripe might be stored on the first disk drive, the parity data for the second stripe on the second disk drive, and so on. The parity block thus traverses the disk drives in a helical pattern.
Since no single storage unit is used to store all of the parity data in a RAID 5 system, the single storage unit bottleneck of RAID 4 systems is alleviated. However, each time data is written to any of the data blocks in a stripe, the parity block must still be read-modified-and-written as in RAID 4.
RAID 5 systems provide the capability for reconstructing one block of corrupted data for every stripe. The corruption of a block of data might result from a local failure confined to a specific sector of a storage unit (for example, a dust particle interfering with a read-write head or from a single disk sector going bad) or from the failure of a storage unit as a whole (resulting, for example, from a head crash or controller failure).
When a response to a read request to a prior art RAID 5 system results in a localized block input-output error, the RAID 5 system typically retries an unsuccessful read of data in a block several times before the RAID 5 system determines that the block is irretrievably bad. Once such a determination is made, the RAID 5 system issues a read request from the other storage units in the redundancy unit for the other blocks in the affected stripe. The missing data is then reconstructed by XOR'ing the good data, and the reconstructed data is then delivered by the RAID 5 system to the device that issued the read request. Thus a block IO failure in prior art RAID 5 systems results in a significant delay before the reconstructed requested data can be delivered as compared to the time required to deliver requested data when there is no IO failure. In addition, the reconstruction process requires that multiple IO requests be issued for each IO request that fails. Accordingly, the reconstruction process ties up system resources and reduces the data throughput of the storage system.
If an entire storage unit fails in a RAID 5 system, a replacement unit can be substituted and the lost data reconstructed, stripe by stripe. In a RAID system disclosed in U.S. Pat. No. 5,208,813, data from a failed storage unit may be read while reconstruction of the data onto the replacement storage unit is taking place. In this prior art system, data is reconstructed stripe by stripe. When a read request is received for data from a block of data from a storage unit other that the failed storage unit, the block of data is read from the appropriate storage unit in the normal manner. When a read request is received for data from a block on the failed storage unit, however, the system issues read requests for all the other data blocks and the parity block for that stripe from the other, functioning storage units. The system then reconstructs the corrupted data, and delivers it to the requesting data processing system.
Prior art RAID systems thus reconstruct data only when the system determines, after multiple read attempts or other means, that a sector of a disk is bad, or, alternatively, if system determines, after a predetermined number of unsuccessful IO operations to the same storage unit or other means, that an entire storage unit has failed. In both cases, in order to deliver the requested data, the system performs additional steps that are not performed during normal operation. Because of the overhead associated with making these determinations and performing these additional steps, delivery of reconstructed corrupted data imposes time delays as compared to delivery of uncorrupted data. In addition, the reconstruction process ties up system resources, decreasing the data throughput of the data storage system. Thus, prior art systems are not able to reliably deliver continuous, high bandwidth, uninterrupted, and undelayed streams of information.
U.S. Pat. No. 5,278,838 issued to Ng et al discloses a method for rebuilding lost data in a RAID system while reducing interference with normal data recovery and storage operations.
U.S. Pat. No. 5,315,602 issued to Noya et al. discloses a system for reducing the number of I/O requests required to write data in a RAID system.
U.S. Pat. No. 5,305,326 issued to Solomon et al. discloses a method for handling reconstruction of data after a power failure, for example after a power failure of an I/O processor in a RAID system.
U.S. Pat. No. 5,303,244 issued to Watson discloses a method for mapping logical RAID storage arrays to physical disk drives.
U.S. Pat. No. 5,235,601 issued to Stallmo et al. discloses a method for restoring valid data in a RAID system after a write failure caused by a storage unit fault.
U.S. Pat. No. 5,233,618 issued to Glider et al. discloses a method and apparatus for detecting and reconstructing incorrectly routed data, for detecting when a failure in writing a block of data has occurred, and for reconstructing the lost data.
U.S. Pat. No. 5,287,462 issued to Jibbe et al. discloses an apparatus for coupling a host bus with a number of storage array busses in a RAID system.
U.S. Pat. No. 5,124,987 issued to Milligan et al. discloses a disk drive array in which updates of redundancy data are eliminated by writing modified "virtual track instances" into logical tracks of the disks comprising a redundancy group.
U.S. Pat. No. 5,088,081 issued to Farr discloses a RAID system in which reconstructed data from a bad data block are stored on a "reserve disk."
U.S. Pat. No. 4,761,785 issued to Clark et al. discloses a storage management system in which parity blocks are distributed among a set of storage devices instead of being stored in a single storage device.