Redundant Arrays of Inexpensive Disks (RAID) technology is known in the art. RAID storage systems are commonly used in high-profile industries, such as the banking and airline industries, where the inability to access certain data for even a moment, let alone its loss, can spell disaster. RAID storage systems are often referred to as "fault-tolerant" due to their ability to access data even when one or more storage devices fails. RAID storage systems accomplish this by distributing redundant copies of data across multiple storage devices. RAID technology is independent of the type of storage device used, and thus may be applied to systems which use magnetic, optical, or semiconductor disk drives, or large capacity tape drives, or a mix of different type storage devices. Several RAID architectures exist for providing redundant access of data. The particular RAID architecture used mandates both the format of the data across the multiple storage devices and the way in which the redundant data is accessed. RAID architectures are categorized in levels ranging from 1-5 according to the architecture of the storage format.
In a level 1 RAID storage system, a duplicate set of data is stored on pairs of "mirrored" storage devices. Accordingly, identical copies of data are stored to each storage device in each pair of mirrored storage devices. The RAID 1 level storage system provides absolute redundancy and therefore high reliability, but it requires twice the storage space. This method is therefore costly and space-consuming.
In a level 2 RAID storage system, each bit of each word or data, plus Error Detection and Correction (EDC) bits for each word, are stored on separate storage devices. Thus, in a 32-bit word architecture having 7 EDC bits, 39 separate storage devices are required to provide the redundancy. In this example, if one of the storage devices fails, the remaining 38 bits of each stored 39-bit word can be used to reconstruct each 32-bit word on a word-by-word basis as each data word is read from the storage devices, thereby obtaining fault tolerance. Although the redundancy is achieved not by duplicating the data but by reconstructing the accessible data, and therefore less actual storage space is required to achieve redundancy, the level 2 RAID storage system has the disadvantage that it requires one storage device for each bit of data and EDC, which can amount to a very large and costly system.
In a level 3 RAID storage system, each storage device itself includes error detection means. This is often achieved using a custom-designed Application Specific Integrated Circuit (ASIC) within the storage device itself that is designed to provide built-in hardware error detection and correction capabilities. Level 3 RAID systems accordingly do not need the more sophisticated multiple EDC bits, which allows a simpler exclusive-or parity checking scheme requiring only one bit to be used to generate parity information. Level 3 RAID storage systems thus only require one storage device to store parity information, which, in combination with each of the data bit storage devices, may be used to recover the accessible bits and reconstruct inaccessible data.
In the level 2 and 3 RAID storage systems, each bit of the data and parity is transferred to and from each respective distributed storage device in unison. In other words, this arrangement effectively provides only a single read/write head actuator for the entire storage device. For large files, this arrangement has a high data transfer bandwidth since each individual storage device actuator transfers part of a block of data, which allows an entire block to be accessed much faster than if a single storage device actuator were accessing the block. However, when the data files to be accessed are small, the random access performance of the drive array is adversely affected since only one data file at a time can be accessed by the "single" actuator.
A level 4 RAID storage system employs the same parity error correction scheme as the level 3 RAID architecture, but essentially decouples the individual storage device actuators to improve on the performance of small file access by reading and writing a larger minimum amount of data, such as a disk sector rather than a single bit, to each disk. This is also known as block striping. In the level 4 RAID architecture, however, writing a data block on any of the independently operating storage devices also requires writing a new parity block on the parity unit. The parity information stored on the parity unit must be read and XOR'd with the old data (to "remove" the information content of the old data), and the resulting sum must then be XOR'd with the new data (to "add" the new parity information). Both the data and the parity records must then be rewritten to the disk drives. This process is commonly referred to as a "Read-Modify-Write" (RMW) operation. Thus, a READ and a WRITE on the single parity storage device occurs each time a record is changed on any of the storage devices covered by a parity record on the parity storage device. The parity storage device becomes a bottleneck to data writing operations since the number of changes to records which can be made per unit of time is a function of the access rate of the parity storage device, as opposed to the faster access rate provided by parallel operation of the multiple storage devices.
A level 5 RAID storage system is similar to the level 4 RAID architecture in its parity error correction scheme and in its decoupling of the individual storage device actuators, but improves upon the performance of WRITE accesses by distributing the data and parity information over all of the available storage devices in a circular fashion. Accordingly, the number of WRITE operations which can be made per unit of time is no longer a function of the access rate of a single parity storage device because the parity information is distributed across all the storage devices. Typically, "N+1" storage devices in a set, or "redundancy group", are divided into a plurality of equally sized address areas referred to as blocks. Each storage device generally contains the same number of blocks. Blocks from each storage device in a redundancy group having the same unit address ranges are referred to as "stripes". Each stripe has N blocks of data, plus one parity block on one storage device containing parity for the N data blocks of the stripe. Further stripes each have a parity block, the parity blocks being distributed on different storage devices. Parity updating activity associated with every modification of data in a redundancy group is therefore distributed over the different storage devices. No single storage device is burdened with all of the parity update activity, and thus the parity storage device access bottleneck is diffused. For example, in a level 5 RAID system comprising five storage devices, the parity information for the first stripe of blocks may be written to the fifth drive; the parity information for the second stripe may be written to the fourth drive; the parity information for the third strip may be written to the third drive, and so on. The parity block for succeeding stripes typically circles around the storage devices in a helical pattern.
The RAID storage systems described above all handle the problem of providing access to redundant data if one or more storage devices fail. However, prior art RAID storage systems provided only one storage device array controller. In such a system, if the controller fails, data is inaccessible regardless of the RAID architecture level, so storage of redundant data is rendered moot.
One solution to this problem is to provide redundant storage device controllers. In RAID storage systems which have redundant controllers, generally only one controller (i.e., the "primary controller") is active for accessing a particular logical volume at a time. Any additional controllers (i.e., "secondary controllers"), operate in a "standby" mode for that particular logical volume. If the primary controller for the particular logical volume fails, one of the secondary controllers takes over to perform accesses for the particular logical volume. Generally, in level 4 and 5 RAID storage systems, only the primary controller is active for a particular logical volume due to the necessity for serializing WRITE operations to the same parity group. Level 4 and 5 RAID storage systems employ the read-modify-write (RMW) method to maintain accurate parity for reconstructing the data. If WRITE operations to the same parity group in these level 4 and 5 RAID storage systems are not serialized, controller collisions, which occur when more than one controller attempts to write data within the same parity group, may cause invalid parity to be generated. Invalid parity results in data which cannot be reconstructed.
FIG. 1 illustrates a controller collision which results in invalid parity. As shown in FIG. 1, four storage devices 1-4 in a RAID storage system respectively store three data blocks 11-13 and a parity block 14, together comprising a single redundancy group. Data block 11 has the value "001"; data block 12 has the value "010"; data block 13 has the value "100"; and parity block 14 has the value "111", which is the exclusive-OR of each of the data values in data blocks 11-13. In the collision example of FIG. 1, the RAID storage system utilizes a primary array controller 20 and a secondary array controller 30. To illustrate the collision, it will be assumed that the primary controller is writing a new value "100" to data block 11 during the time that the secondary array controller is writing a new value "110" to data block 12. A write to a data block requires not only the data to be written to the appropriate data block, but that a read-modify-write (RMW) operation be performed on the associated parity block (i.e., the parity block must be read, updated by removing the old data content and adding the new data content, and then written back to the parity storage device). Accordingly, primary array controller 20 reads the old data value "001" from the data block 11 into an old data register 21 and the old parity value "111" from the parity block 14 into an old parity register 22. The data content is then "removed" from the old parity by performing an exclusive-OR on the old data value stored in the old data register 21 and the old parity value stored in the old parity register 22. The resulting value "110" of the exclusive-OR calculation is stored in a "removed" parity register 23. The new data value "100" is stored in a new data register 24. A new parity value is then calculated by performing an exclusive-OR on the "removed" parity value and the new data value, and the resulting value "010" is then stored in a new parity register 25.
If, during the time that the primary array controller 20 is calculating a new parity value for data block 11, the secondary array controller 30 performs a WRITE operation to a data block in the same redundancy group, it is possible that the RMW operation for updating the parity will collide with the RMW operation by the primary array controller 20. The problem is illustrated in FIG. 1, where the secondary array controller is writing a new value "110" to data block 12. Accordingly, secondary array controller 30 reads the old data value "010" from the data block 12 into an old data register 31 and the old parity value "111" from the parity block 14 into an old parity register 32. The data content is then "removed" from the old parity by performing an exclusive-OR on the old data value stored in the old data register 31 and the old parity value stored in the old parity register 32. The resulting value "101" of the exclusive-OR calculation is stored in a "removed" parity register 33. The new data value "110" is stored in a new data register 34. A new parity value is then calculated by performing an exclusive-OR on the "removed" parity value and the new data value, and the resulting value "011" is then stored in a new parity register 25.
As will be appreciated from the above description of the RMW operations by the primary and secondary controllers, one or the other controller must complete its RMW operation before the other controller can begin its RMW operation. Otherwise, the array controller that performs the RMW operation before the other controller has finished the write portion of the RMW operation will read a parity value that is no longer valid, resulting in the propagation of invalid parity values from that point on. Accordingly, prior art RAID storage systems which have redundant array controllers either allow only one controller to operate at a time, such as utilizing only the primary array controller unless it fails and only then utilizing the secondary controller, or by statically binding each storage device to one array controller or another and allowing WRITEs to any given storage device to be controlled only by the array controller which owns it.
The redundant controller schemes of the prior art are problematic. Since one or more array controller resources remain idle unless the primary controller fails, the RAID storage system can never operate at full capacity. Indeed, these idle resources essentially limit the I/O bandwidth of the RAID storage system to half or less. One solution to this problem includes modifying the array controllers to be aware of each other's RMW operations and to schedule accordingly. However, this solution adds complexity and expense to the current controllers in the industry. It would be desirable to provide a system and method for achieving dynamic load balancing of READ and WRITE access requests across multiple redundant array controllers to increase the I/O bandwidth without modifying the current controllers.