1. Technical Field:
The present invention relates to sets of mass storage devices that collectively perform as one or more logical mass storage devices. In particular, the invention relates to a system and method of command queuing on parity drives in RAID levels 4 and 5 systems.
2. Description of the Related Art:
Use of disk memory continues to be important in computers because it is nonvolatile and because memory size demands continue to outpace practical amounts of main memory. At this time, disks are slower than main memory so that system performance is often limited by disk access speed. Therefore, it is important for overall system performance to increase memory size and data access speed of disk drive units. For a discussion of this, see Michelle Y. Kim, "Synchronized Disk Interleaving", IEEE Transactions On Computers, Vol. C-35, No. 11, November 1986.
Disk memory size can be increased by increasing the number of disks and/or increasing the diameters of the disks, but this does not increase data access speed. Memory size and data transfer rate can both be increased by increasing the density of data storage. The data transfer rate can be increased by increasing disk rotational speed. However, technological constraints limit data density and high density and high speed disks are more prone to errors.
A variety of techniques have been utilized to improve data access speed. Disk cache memory capable of holding an entire track of data has been used to eliminate seek and rotation delays for successive accesses to data on a single track. Multiple read/write heads have been used to interleave blocks of data on a set of disks or on a set of tracks on a single disk. Common data block sizes are byte size, word size, and sector size. Disk interleaving is a known supercomputer technique for increasing performance, and is discussed, for example, in the above-noted article.
Data access performance can be measured by a number of parameters, depending on the relevant application. In transaction processing (such as in banking) data transfers are typically small and request rates are high and random. In supercomputer applications, on the other hand, transfers of large data blocks are common.
Recently developed, interrelated disk memory architectures with improved performance at relatively low cost are grouped under the term "Redundant Arrays of Inexpensive Disks" (RAID). See, for example, David A. Patterson, et al., "A Case for Redundant Arrays of Inexpensive Disks (RAID)", Report No. UCB/CSD 87/89, December, 1987, Computer Science Division (EECS), University of California, Berkeley, Calif. 94720. As discussed in the Patterson et al. reference, the large personal computer market has supported the development of inexpensive disk drives having a better ratio of performance to cost than Single Large Expensive Disk (SLED) systems. The number of I/Os per second per read/write head in an inexpensive disk is within a factor of two of the large disks. Therefore, the parallel transfer from several inexpensive disks in a RAID architecture, in which a set of inexpensive disks function as a single logical disk drive, produces greater performance than a SLED at a reduced price.
Unfortunately, when data is stored on more than one disk, the mean time to failure (MTTF) varies inversely with the number of disks in the array. To correct for this decreased mean time to failure of the system, error recognition and correction is characteristic of all RAID architectures. The Patterson et al. reference discusses 5 RAID architectures each having a different means for error recognition and correction. These RAID architectures are referred to as RAID levels 1-5.
RAID level 1 utilizes complete duplication of data (sometimes called "mirroring") and so has a relatively small performance per disk ratio. RAID level 2 improves this performance as well as the capacity per disk ratio by utilizing error correction codes that enable a reduction of the number of extra disks needed to provide error correction and disk failure recovery. In RAID, level 2, data is interleaved onto a group of G data disks and error correction codes (ECC) are generated and stored onto an additional set of C disks referred to as "check disks" to detect and correct a single error. The ECC are used to detect and enable correction of random single bit errors in data and also enables recovery of data if one of the G data disks crashes. Since only G of the C+G disks carries user data, the performance per disk is proportional to G/(G+C). G/C is typically significantly greater than 1, so RAID level 2 exhibits and improvement in performance per disk over RAID level 1. One or more spare disks can be included in the system so that if one of the disk drives fails, the spare disk can be electronically switched into the RAID to replace the failed disk drive.
RAID level 3 is a variant of RAID level 2 in which the error detecting capabilities that are provided by most existing inexpensive disk drives are utilized to enable the number of check disks to be reduced to one, thereby increasing the relative performance per disk over that of RAID level 2. Typically parity data is substituted for ECC. Either ECC, some other error code, or parity data may be termed redundant data. For both RAID levels 2 and 3 the transaction time for disk accesses for large or grouped data is reduced because bandwidth into all of the data disks can be exploited.
The performance criteria for small data transfers, such as is common in transaction processing, is known to be poor for RAID levels 1-3 because data is interleaved among the disks in bit-sized or byte-sized blocks, such that even for a data access of less than one sector of data, all disks must be accessed. To improve this performance parameter, in RAID level 4, a variant of RAID level 3, data is interleaved onto the disks in sector interleave mode instead of in bit or byte interleave mode as in levels 1-3. In other words, individual I/O transfers involve only a single data disk. The benefit of this is from the potential for parallelism of the input/output operations. This reduces the amount of competition among separate data access requests to access the same data disk at the same time.
Nonetheless the performance of RAID level 4 remains limited because of access contention for the check disk during write operations. For all write operations, the check disk must be accessed in order to store updated parity data on the check disk for each stripe (i.e., row of sectors) of data into which data is written. Patterson et al. observed that in RAID level 4 and level 5, an individual write to a single sector does not involve all of the disks in a logical mass storage device since the parity bit on the check disk is just a single exclusive OR of all the corresponding data bits in a group. In RAID level 4, write operations always involve reading and rewriting the parity disk, making the parity disk the bottleneck in access to the array for low current write operations. RAID level 5, a variant of RAID level 4, mitigates the contention problem on write operations by distributing the parity check data and user data across all disks. For RAID level 4, large write operations (those extending to all of a parity stripe unit) do not require a preliminary read.
However, contention issues still arise. Both RAID level 4 and 5 have required, with each read-modify-write operation (e.g., an update of a record), 2 accesses to each of 2 disks. An update involves a read of the existing user data on a data disk and a read of parity data for the stripe to which the user data belongs on a parity disk. This is followed by writes to both disks of the updated user data and parity data respectively. The read operations are prerequisite to calculating updated parity, which is done using the following function: EQU new parity=(old data.XOR.new data).XOR.old parity.
To prevent loss of coherence of parity data, processing of a data update operation on RAID levels 4 and 5 mass storage systems has required atomic, or serialized, read-modify-write operations during which the drive storing parity data has been locked to prevent change of the parity information for another data update operation before the first update is complete. Coherence of parity means that parity continues to represent the equivalent of and a series of exclusive OR operations performed sequentially of the data of the parity group. Drive locking prevents command queuing in disk subsystems which support Tagged Command Queuing (TCQ).
Tagged Command Queuing is defined in the standards for the Small Computer Systems Interface (SCSI). It provides for handling multiple commands being sent to a drive by a host without waiting for responses. Commands, and responses, are tagged allowing the host to match response to request. In some systems optimization in terms of execution order of the operations may be done to improve drive performance. Linked commands are provided for to insure execution of commands in a predetermined order if required. Serializing access to the drive prevents command queuing, and subsequently the disk subsystem controller cannot optimize operation sequence and performance of the disk subsystems suffers greatly.
The term "striping" is often seen in reference to the RAID art. Striping is data interleaving over a plurality of disk drives by "strip units." A stripe unit is a group of logically contiguous data that are written physically consecutively on a single disk before placing data on a different disk. A data stripe comprises a logical collection of stripe units.