This invention relates to mass information storage systems used in high-performance computer systems. More particularly, this invention relates to new and improved data structures used for containing and storing sequence number metadata and revision number metadata for use in a data integrity-assuring technique to detect and correct data path errors or drive data corruption errors which may inadvertently occur during the transfer of data to, or the retrieval of data from storage media, such as a redundant array of independent disks (RAID) mass storage system.
In high-performance computer systems, a mass storage system must be capable of rapidly supplying information to each processor of the computer system in a xe2x80x9creadxe2x80x9d operation, rapidly transferring information for storage in a xe2x80x9cwritexe2x80x9d operation, and performing both read and write operations with a high level of integrity so that the information is not corrupted or lost. Incorrect, corrupted or lost information defeats or undermines the effectiveness of the computer system. The reliability of the information is an absolute necessity in most if not all business computing operations.
A variety of high-performance mass storage systems have been developed to assure rapid information storage and retrieval operations. The information storage and retrieval operations are generally the slowest operations performed by computer system; consequently the information storage and retrieval operations limit the speed and functionality of the computer system itself.
One popular mass storage system which offers relatively rapid information storage and retrieval capabilities at moderate cost, as well as the capability for assuring a relatively high integrity of the information against corruption and loss, is a redundant array of independent or inexpensive disks (RAID) mass storage system. In general, a RAID mass storage system utilizes a relatively large number of individual, inexpensive disk drives which are controlled separately and simultaneously. The information to be written is separated into smaller components and recorded simultaneously or nearly simultaneously on multiple ones of the disk drives. The information to be read is retrieved almost simultaneously in the smaller components from the multiplicity of disk drives and then assembled into a larger total collection of information requested. By separating the total information into smaller components, the time consumed to perform reading and writing operations is reduced. On the other hand, one inherent aspect of the complexity and speed of the read and write operations in a RAID mass storage system is an increasing risk of inadvertent information corruption and data loss arising from the number of disk drives and the number and complexity of the input/output (I/O) operations involved.
Various error correction and integrity-assuring software techniques have been developed to assure that inadvertent errors can be detected and that the corrupted information can be corrected. The importance of such integrity-assuring techniques increases with higher performance mass storage systems, because the complexity of the higher performance techniques usually involve an inherent increased risk of inadvertent errors. Some of these integrity-assuring techniques involve the use of separate software which is executed concurrently with the information storage and retrieval operations, to check and assure the integrity of the storage and retrieval operations. The use of such separate software imposes a performance penalty on the overall functionality of the computer system, because the concurrent execution of the integrity-assuring software consumes computer resources which could otherwise be utilized for processing, reading or writing the information. Another type of integrity-assuring technique involves attaching certain limited metadata to the data to be written, but then requiring a sequence of separate read and write operations involving both the new data and the old data. The number of I/O operations involved diminish performance of the computer system. Therefore, it is important that any integrity-assuring software impose only a small performance degradation on the computer system. Otherwise the advantages of the higher performance mass storage and computing system will be lost or diminished.
Although the integrity-assuring software techniques used in most mass storage systems are reliable, there are a few classes of hardware errors which seem to arise inadvertently and which are extremely difficult to detect or correct on a basis which does not impose a performance degradation. These types of errors seem prone to occur to the disk drives, almost inexplicably. One example of this type of an error involves the disk drive accepting information in a write request and acknowledging that the information has been correctly written, without actually writing the information to the storage media. Another example involves the disk drive returning information in response to a read request that is from an incorrect disk memory location. A further example involves the disk drive writing information to the wrong address location. These types of errors are known as xe2x80x9csilentxe2x80x9d errors, and are so designated because of the apparent, but nevertheless incorrect, accuracy of the operations performed.
The occurrence of silent errors is extremely rare. However, such errors must be detected and/or corrected in computer systems where absolute reliability of the information is required. Because of the extremely infrequent occurrence of such silent errors, it is not advantageous to concurrently operate any integrity-assuring software or technique that imposes a continuous and significant penalty of performance degradation on the normal, error-free operations of the computer system.
Apart from silent errors, there are other situations in which data and parity inconsistency are detected due to incomplete write operations, failed disk input/output (I/O) operations or other general firmware and hardware failures. In such circumstances, it is desirable to utilize a technique to make determinations of consistency in the data and parity. Parity is additional information that is stored along with the data that defines the data and allows for reconstruction of the data. By knowing either the correct data or the correct parity, it is possible to correctly regenerate the correct version of incorrect data or parity. While a variety of integrity-assuring software techniques are available to regenerate the correct data or the correct parity, it is desirable to avoid the performance degradation penalty by continually executing separate software to continuously check data and parity.
It is with respect to these and other background considerations that the present invention has evolved.
The present invention involves two improved data structures used for containing a sequence number and a revision number created as metadata and for storing the sequence number and revision number metadata with the user data itself in a group of data structures of a mass storage system, such as a redundant array of independent disks (RAID) mass storage system. The data structures of this invention promote the use of the sequence number and the revision number in an effective way, which does not impose a significant performance degradation penalty or storage capacity penalty on the computer system or the mass storage system, and which detects and typically corrects data path and corruption errors, as well as data and parity errors.
One aspect of the present invention pertains to a group of data structures defining an input/output (I/O) operation performed on storage media and used to contain user data and to contain metadata which is used to detect errors arising from other I/O operations performed on one or more of the data structures of the group. The group of data structures includes a plurality of user data structures and a parity data structure associated with the user data structures. Each user data structure has a plurality of fields including a user data field for containing the user data, a sequence number field for containing sequence number information identifying the I/O operation, such as in a full stripe write operation, which originated the group of data structures, and a revision number field for containing revision number information identifying a subsequent I/O operation, such as a read modify write (RMW), performed on the user data in the user data field of a user data structure. The associated parity data structure has a plurality of fields including a parity field for containing parity information describing the collection of user data in the user data fields of each of the associated user data structures, a sequence number field for containing sequence number information identifying the I/O operation which originated the group of structures, and a revision number field for each associated user data structure which contains revision number information identifying the subsequent I/O operation in which the user data was written in the user data field of the associated user data structure. The parity data structure and the associated plurality of user data structures collectively and logically constitute the separate unit of I/O information recorded on the storage media, such as a full stripe.
Errors arising from I/O operations are detected and corrected using the user data structures and the parity data structure. The sequence numbers are read from one user data structure and from the parity data structure during a subsequent read operation, and a determination is made of whether the sequence numbers match. If the sequence numbers do not match, a sequence number from another user data structure is read, and a correct sequence number is determined as that sequence number which is equal to the two matching ones of the three sequence numbers. Any detected errors in the user data structure are corrected by using the user data from the user data structures having the correct sequence number and the parity information from the parity data structure to construct the correct user data and metadata information to replace the incorrect user data structure. Any detected sequence number error in the parity data structure results in correcting the parity data structure by using the user data and metadata from the user data structures to construct the correct metadata information in the parity data structure.
To detect and correct errors arising from read modify write (RMW) operations, the revision number is read from the user data structure and from the parity data structure. When the revision numbers do not match an error is indicated. The revision number which is indicative of a subsequent I/O operation, is regarded as the correct revision number, accounting for any wrapping of the value of the revision number due to an overflow of bits in the available field size for the revision number. If the revision number from the parity data structure is incorrect, the correct metadata and parity information is constructed from the user data and metadata read from the user data structures. If the revision number read from one of the user data structures is incorrect, the correct user data and metadata for the incorrect user data structure is constructed from the user data and metadata read from the other user data structures and from the information read from the parity data structure.
Silent errors arising from drive data and data path corruption of some of the user data are detected by use of the sequence number metadata contained in the user data and parity structures. Silent errors arising from corruption during a RMW operation are detected by use of the revision number metadata contained in the user data and parity structures. The parity information contained in the parity data structure is used in conjunction with the user data and metadata contained in the user data structures to correct the corrupted information. The corrupted information is corrected and replaced as an adjunct of the normally-occurring read operations.
Other preferable aspects of the data structures include a plurality of separate user data fields, where each separate user data field contains user data in the user data structure, and a plurality of separate parity fields in the parity data structure, where each separate parity field correlates to a separate user data field of each user data structure. Each separate parity field is for containing parity information describing the user data contained in the correlated user data field. Each user data structure further preferably includes a sequence number field correlated with the separate user data fields of that user data structure, and the sequence number field is for containing sequence number information of the user data contained in the correlated separate user data field. Each user data structure further preferably includes a revision number field correlated with the separate user data fields of that user data structure, and the revision number field is for containing revision number information describing the user data in the correlated separate user data field. The revision number information describes user data which was written in a subsequent I/O operation in which the user data in one user data structure in a stripe is written apart from writing the user data in the other user data structures. Each user data structure further preferably includes a code field correlated with each separate user data field for containing error detecting code (e.g. CRC) information describing the user data contained in the correlated separate user data field. Further still, each separate user data field of each user data structure further includes a plurality of divisions into separate user data field subdivisions, and each separate parity field of the parity data structure further includes a similar plurality of divisions into separate parity field subdivisions, each of which correlates to a separate user data field subdivision of the plurality of associated user data structures. Further still, each separate parity field subdivision is for containing parity information describing the user data contained in the correlated separate user data field subdivision. The parity data structure further preferably includes a code field correlated with each separate parity field subdivision for containing error detecting code (e.g. CRC) information describing the parity information contained in the correlated separate parity field subdivision.
Another aspect of the invention relates to a layout of the user data structures and the parity data structure in an information storage media which is divided into sectors of uniform length. Each user data structure includes a user data region which includes each separate user data field, and the user data region consumes a plurality of continuous sectors on the storage media. Each user data structure further includes a metadata region in which each separate sequence number field and each separate revision number field, among other descriptive information, is located. The metadata region of each user data structure consumes one sector on the storage media adjacent to the contiguous sectors occupied by the user data region.
Another layout of the user data structures and the parity data structure in an information storage media which is divided into sectors of uniform length involves each separate user data field consuming a plurality of logically contiguous sectors on the storage media. The metadata region of each user data structure, in which each separate sequence number field and each separate revision number field is located, is divided into separate metadata regions which correlate to each separate user data region, and each separate metadata region has a size equal to a fractional portion of a sector. In one case, each separate metadata region is located in one sector on the storage media logically contiguous with the plurality of sectors of the correlated user data field, and another plurality of sectors occupied by another separate user data field is located logically contiguous with the separate metadata region correlated with that other user data field. In another case, each separate metadata region is located in one sector on the storage media logically contiguous with the plurality of sectors occupied by one correlated user data field, the remaining portion of the sector not occupied by the metadata region is unused, and another plurality of sectors occupied by another user data field is located logically contiguous with the metadata region correlated with the one separate user data field.
Another aspect of this invention relates to a method of containing user data and metadata information in the user data structures and the associated parity structure, by writing user data in the user data fields of the user data structure, writing sequence number information in the sequence number fields of the user data structures and the parity data structure, writing revision number information in the revision number fields of the user data structures and the parity data structure, and writing parity information in the parity fields of the parity data structure.
Organizing the user data structures and the associated parity data structure in the manner described facilitates the efficient use of readily-available information to detect and correct errors arising from I/O operations, as an adjunct to the commonly-performed read operations. For redundancy, the user data structures and the parity data structures of the full stripe write must be on separate units of the storage media, such as on separate disk drives of a single redundancy group of a RAID mass storage system. Laying out the user data and parity structures to accommodate standard sector sizes of storage media facilitates the efficient utilization of the storage media, since only a small amount of the media space is not occupied by, useful user data and parity information.
A more complete appreciation of the present invention and its scope, and the manner in which it achieves the above noted improvements, can be obtained by reference to the following detailed description of presently preferred embodiments of the invention taken in connection with the accompanying drawings, which are briefly summarized below, and the appended claims.