1. Field of the Invention
This invention relates generally to error detection techniques for storage access operations in multi-processor computer systems. More specifically, it relates to the generation and validation of error detection codes embedded in portions of data stored in a file cache system.
2. Background Information
In modern computer systems, data is continuously transferred between processors, peripherals, storage devices, and display devices. Errors may be introduced during the reading, writing, or actual transmission of this data. Consequently, error detection and control has become an integral part of the design of computer systems. Error detection methods typically involve the addition of one or more redundancy bits to the information-carrying bits of the data in order to detect the inevitable errors. These redundancy bits usually do not carry any information per se; they are merely used to determine the correctness of the bits carrying the information.
One form of data redundancy is the use of parity bits and simple parity checking. Parity is well known, and simply involves summing without carry the one bits in a data word and providing an additional parity bit that renders the total count across the data word, including the parity bit, either odd or even. Whenever a data word is transferred, the receiver generates a parity value for the received data word and compares it to the appended parity bit sent by the sender. If they do not match, a parity error has occurred and the data transferred is considered suspect. Simple parity schemes detect single bit errors, but do not detect the problem of multiple bit errors in the data word.
A form of two-dimensional parity checking used on some systems can detect and even correct some types of errors. The data words are arranged in a block of columns with an odd parity bit, called a vertical redundancy check (VRC), added to make the sum of the one bits in each column an odd number. Similarly, an odd parity-check bit, called the longitudinal redundancy check (LRC), is added at the end of the block for each row of bits. As each data block is read, the VRC and LRC are regenerated and compared to the embedded check values to detect any errors.
A more powerful error detecting method is cyclic redundancy checking (CRC). Here, all data words in a data block are treated as a serial string of bits representing a binary number. This number is divided modulo 2 by a predetermined binary number and the remainder of this division is appended to the data block as a cyclic redundancy check code. The embedded CRC code is compared with the code obtained in a similar fashion by the receiver of the data. If they agree, the data transfer is presumed to be correct. The CRC code is often called a cyclic check sum, or simply a checksum. Various methods of generating CRC codes are described in "Technical Aspects of Data Communication" by John E. McNamara, pp. 110-122, and "An Introduction to Error-Correcting Codes" by Shu Lin.
The redundancy schemes described above are useful for detecting transmission errors and thus guaranteeing some degree of integrity of the data retrieved from another component in a computer system. But in the most recently developed computer systems, merely guaranteeing correct transmission may not be enough to ensure fault tolerant operation. Consider the situation where a computer system's data is being stored on an outboard file cache system as described in co-pending application, Ser. No. 08/174,750, assigned to Unisys Corporation. In this system architecture, multiple requesters are concurrently reading and writing file data into the file cache. Checksums are used to detect transmission errors when storing or retrieving file data. Checksums, however, do not provide enough support for detecting errors occurring in the system microcode and hardware during file accesses. If a pointer to cached file data managed by the microcode is corrupted for some reason, an incorrect file access request may be made to the file cache. As a result, the wrong data may be retrieved and passed to the requester. This error will go undetected for some indeterminate amount of processing, until the requester, or a process receiving the file data from the requester, discovers that a catastrophic failure has occurred.
A new form of data redundancy providing improved error detection is needed to ensure that processes requesting access to file data stored in a file cache system obtain the correct file data. If incorrect file data is retrieved, this error must be detected at the earliest possible time, before propagation of the error throughout the system occurs.