The present invention relates to sets of physical mass storage devices that collectively perform as one or more logical mass storage devices. In particular, the present invention relates to methods and apparatus for maintaining data integrity across such a set of physical mass storage devices.
Use of disk memory continues to be important in computers because it is nonvolatile and because memory size demands continue to outpace practical amounts of main memory. At this time, disks are slower than main memory so that system performance is often limited by disk access speed. Therefore, it is important for overall system performance to improve both memory size and data access speed of disk drive units. For a discussion of this, see Michelle Y. Kim, "Synchronized Disk Interleaving", IEEE Transactions On Computers, Vol. C-35, No. 11, November 1986.
Disk memory size can be increased by increasing the number of disks and/or increasing the diameters of the disks, but this does not increase data access speed. Memory size and data transfer rate can both be increased by increasing the density of data storage. However, technological constraints limit data density and high density disks are more prone to errors.
A variety of techniques have been utilized to improve data access speed. Disk cache memory capable of holding an entire track of data has been used to eliminate seek and rotation delays for successive accesses to data on a single track. Multiple read/write heads have been used to interleave blocks of data on a set of disks or on a set of tracks on a single disk. Common data block sizes are byte size, word size, and sector size. Disk interleaving is a known supercomputer technique for increasing performance, and is discussed, for example, in the above-noted article.
Data access performance can be measured by a number of parameters, depending on the relevant application. In transaction processing (such as in banking) data transfers are typically small and request rates are high and random. In supercomputer applications, on the other hand, transfers of large data blocks are common.
A recently developed disk memory structure with improved performance at relatively low cost is the Redundant Array of Inexpensive Disks (RAID) (see, for example, David A. Patterson, et al., "A Case for Redundant Arrays of Inexpensive Disks (RAID)", Report No. UCB/CSD 87/39, December, 1987, Computer Science Division (EECS), University of California, Berkeley, Calif. 94720. As discussed in the Patterson et al. reference, the large personal computer market has supported the development of inexpensive disk drives having a better ratio of performance to cost than Single Large Expensive Disk (SLED) systems such as the IBM 3380. The number of I/Os per second per read/write head in an inexpensive disk is within a factor of two of the large disks. Therefore, the parallel transfer from several inexpensive disks in a RAID architecture, in which a set of inexpensive disks function as a single logical disk drive, produces greater performance than a SLED at a reduced price.
Unfortunately, when data is stored on more than one disk, the mean time to failure varies inversely with the number of disks in the array. To correct for this decreased mean time to failure of the system, error recognition and correction is built into the RAID systems. The Patterson et al. reference discusses 5 RAID embodiments each having a different means for error recognition and correction. These RAID embodiments are referred to as RAID levels 1-5.
RAID level 1 utilizes complete duplication of data and so has a relatively small performance per disk ratio. RAID level 2 improves this performance as well as the capacity per disk ratio by utilizing error correction codes that enable a reduction of the number of extra disks needed to provide error correction and disk failure recovery. In RAID level 2, data is interleaved onto a group of G data disks and error codes are generated and stored onto an additional set of C disks referred to as "check disks" to detect and correct a single error. This error code detects and enables correction of random single bit errors in data and also enables recovery of data if one of the G data disks crashes. Since only G of the C+G disks carries user data, the performance per disk is proportional to G/(G+C). G/C is typically significantly greater than 1, so RAID level 2 exhibits an improvement in performance per disk over RAID level 1. One or more spare disks can be included in the system so that if one of the disk drives fails, the spare disk can be electronically switched into the RAID to replace the failed disk drive.
RAID level 3 is a variant of RAID level 2 in which the error detecting capabilities that are provided by most existing inexpensive disk drives are utilized to enable the number of check disks to be reduced to one, thereby increasing the relative performance per disk over that of RAID level 2.
The performance criteria for small data transfers, such as is common in transaction processing, is known to be poor for RAID levels 1-3 because data is interleaved among the disks in bit-sized blocks, such that even for a data access of less than one sector of data, all disks must be accessed. To improve this performance parameter, in RAID level 4, a variant of RAID level 3, data is interleaved onto the disks in sector interleave mode instead of in bit interleave mode as in levels 1-3. The benefit of this is that, for small data accesses (i.e., accesses smaller than G+C sectors of data), all disks need not be accessed. That is, for a data access size between k and k+1 sectors of data, only k+1 data disks need be accessed. This reduces the amount of competition among separate data access requests to access the same data disk at the same time.
Yet the performance of RAID level 4 remains limited because of access contention for the check disk during write operations. For all write operations, the check disk must be accessed in order to store updated parity data on the check disk for each stripe (i.e., row of sectors) of data into which data is written. Therefore, write operations interfere with each other, even for small data accesses. RAID level 5, a variant of RAID level 4, avoids this contention problem on write operations by distributing the parity check data and user data across all disks.
Errors in data in a RAID architecture, such as those resulting from hardware failure, can manifest themselves in several ways. First, data within a data block may be corrupted during a read or write operation. Such a failure to correctly write or read data on the disk is normally detected by a check of parity, Error Correction codes (ECC) and/or Cyclic Redundancy Check (CRC) codes that are generated at the time the data is stored and that are checked each time the data is written or read. This type of check is limited to validating the data path within the disk drive.
Other potential errors in data, however, require additional error detection capability. For example, during a write operation, a drive can fail to write any data at all. In this case, in a RAID 4 or 5 architecture, a readback of the data (including a check of any parity, ECC or CRC codes) would not detect that old data is being accessed in place of the data intended to be accessed. Small disk drives often do not include special logic to detect a failure to write any data.
Also, a hardware failure may result in data being written to or read from the wrong disk or wrong sector within a disk due to misrouting of data within the controller for the RAID. On a subsequent readback of the data in the RAID 4 or 5 architectures, again no problem would be detected. Although it is known to add a logical block address type field to data when writing the data to a disk drive, this does not provide assurance that the data block at that address is valid if in the array more than one physical location may have the same logical block address. This condition may exist, for example, in an array operated as a plurality of logical units. Although, as noted elsewhere herein, previously known RAID arrays operate only as a single logical unit, a novel method for operating a set of physical mass storage devices (e.g., a RAID system) as a plurality of logical units is also referred to herein, and pursuant to this novel method the set may include more than one physical location with the same logical block address.
In view of the foregoing, it would be desirable to be able to provide a way to detect and, where possible, correct data errors resulting from misrouting of data within a data storage system comprising a set of physical mass storage devices.
It would also be desirable to be able to provide a way to detect and, where possible, correct data errors resulting from a failure to write on one or more devices while performing write requests in a such a data storage system.