1. Field of the Invention
The present invention generally relates to general purpose digital data processing systems and more particularly relates to such systems which utilize error detection and correction schemes therein.
2. Description of the Prior Art
A key design element of high reliability computer systems is that of error detection and correction. It has long been recognized that the integrity of the data bits within the computer system is critical to ensure the accuracy of operations performed in the data processing system. The alteration of a single data bit in a data word can dramatically affect arithmetic calculations or can change the meaning of a data word as interpreted by other sub-systems within the computer system.
The cause of an altered data bit may be traced to either a "soft-error" or a "hard error" within a memory element. Soft errors are not permanent in nature and may be caused by alpha particles, electromagnetic radiation, random noise, or other non-destructive events. Soft errors are often referred to as bit-flips indicating that a bit has inadvertently been flipped from a one to a zero or visa versa. Hard errors, on the other hand, are permanent in nature and are often referred to as stuck-at faults. Typically, a hard error may be caused by a manufacturing defect in a memory element or by some other destructive event such as a voltage spike.
One method for performing error detection is to associate an additional bit, called a "parity bit", along with the binary bits comprising a data word. The data word may comprise data, an instruction, an address, etc. Parity involves summing without carry the bits representing a "one" within a data word and providing an additional "parity bit" so that the total number of "ones" across the data word, including the added parity bit, is either odd or even. The term "Even Parity" refers to a parity mechanism which provides an even number of ones across the data word including the parity bit. Similarly, the term "Odd Parity" refers to a parity mechanism which provides an odd number of ones across the data word including the parity bit.
A typical system which uses parity as an error detection mechanism has a parity generation circuit for generating the parity bit. For example, when the system stores a data word into memory, the parity generation circuit generates a parity bit from the data word and the system stores both the data word and the corresponding parity bit into an address location in a memory. When the system reads the address location where the data word is stored, both the data word and the corresponding parity bit are read from the memory. The parity generation circuit then regenerates the parity bit from the data bits read from the memory device and compares the regenerated parity bit with the parity bit that is stored in memory. If the regenerated parity bit and the original parity bit do not compare, an error is detected and the system is notified.
It is readily known that a single parity bit in conjunction with a multiple bit data word can detect a single bit error within the data word. However, it is also readily known that a single parity bit in conjunction with a multiple bit data word can be defeated by multiple errors within the data word. As calculation rates increase, circuit sizes decrease, and voltage levels of internal signals decrease, the likelihood of a multiple errors within a data word increase. Therefore, methods to detect multiple errors within a data word are essential.
System designers have developed methods for detecting multiple errors within multiple bit data words by providing multiple parity bits for each multiple bit data word. Although this technique has been successfully used, it significant increases the overhead required to perform error detection because the parity generation circuit is more complex and the additional parity bits must be stored along with each data word. It can readily be seen that each additional parity bit that is included within a system adds a significant amount of overhead to the system.
Parity generation techniques are also used to perform error correction within a data word. Error correction is typically performed by encoding the data word to provide error correction code bits that are stored along with the bits of the data word. Upon readout, the data bits read from the addressable memory location are again subject to the generation of the same error correction code signal pattern. The newly generated pattern is compared to the error correction code signals stored in memory. If a difference is detected, it is determined that the data word is erroneous. Depending on the encoding system utilized it is possible to identify and correct the bit position in the data word indicated as being incorrect. The system overhead for the utilization of error correction code signals is substantial. The overhead includes the time necessary to generate the error correction codes, the memory cells necessary to store the error correction code bits for each corresponding data word, and the time required to perform the decode when the data word is read from memory. These represent disadvantages to the error correction code system.
Error detection schemes may be used on various internal nodes of a computer system. That is, for high reliability computer systems, many of the data paths within the computer system may have an error detection scheme incorporated therein. However, because of the relatively high overhead cost associated with multiple bit error detection, usually only a limited number of parity bits or the like may be provided. Further, because of the relatively high overhead cost associated with error corrections schemes, only the most critical data paths may utilize such schemes. Finally, because error correction schemes may degrade the performance of a corresponding data path of the computer system, the use of such schemes is often precluded on time critical data paths.
In addition to the above referenced limitations, typical error detection schemes cannot determine the source or nature of an error. Rather, error detection schemes typically only identify that an error exists on a corresponding bus. Under some circumstances, it may be important to identify the underlying hardware element that is the source of the error and also identify the nature of the fault. For example, if an error is detected in a microcode instruction of a computer system, it may be important to determine the hardware source of the error and whether the error is fatal. An error may be considered fatal if the error cannot be corrected without aborting the operation of the computer system. For the example described above, the source of the error may be a memory device and the nature of the error may be a soft error or a hard error. A soft error may be corrected during the operation of the computer system by simply over-writing the correct data to the corrupted memory location, and therefore a soft error may not be deemed to be fatal. However, a hard error within the memory element cannot be corrected without aborting the operation of the computer system and replacing the memory element, and therefore a hard error may be deemed to be a fatal error.
As can be seen, a number of otherwise non-fatal errors may be deemed to be fatal because the source and nature of the error cannot be identified during the operation of the computer system. That is, because none of the prior art error detection schemes provide a mechanism for identifying the source and nature of an error during the operation of the computer system, the system may abort when it would otherwise not be necessary. The use of prior art error detection schemes may, therefore, require a computer system to assume that a non-fatal error to be a fatal error, in order to preserve the integrity of the data base. For example, any error detected in a microcode word may be considered fatal, even if the error is a soft error within a memory wherein the soft error may be corrected by simply writing a correct microcode word to the corrupted memory location. Any further error analysis may be performed by a support controller, but only after the operation of the computer system is aborted. As can readily be seen, this may limit the overall reliability and performance of the corresponding computer system.