1. Field of the Invention
This invention is related to error detection and correction in computing systems.
2. Description of the Related Art
Error codes are commonly used in electronic and computing systems to detect and correct data errors, such as transmission errors or storage errors. For example, error codes may be used to detect and correct errors in data transmitted via any transmission medium (e.g. conductors and/or transmitting devices between chips in an electronic system, a network connect, a telephone line, a radio transmitter, etc.). Error codes may additionally be used to detect and correct errors associated with data stored in the memory of computer systems. One common use of error codes is to detect and correct errors of data transmitted on a data bus of a computer system. In such systems, error correction bits, or check bits, may be generated for the data prior to its transfer or storage. When the data is received or retrieved, the check bits may be used to detect and correct errors within the data.
Another source of errors in electrical systems may be so-called “soft” or “transient errors”. Transient memory errors may be caused by the occurrence of an event, rather than a defect in the memory circuitry itself. Transient memory errors may occur due to, for example, random alpha particles striking the memory circuit. Transient communication errors may occur due to noise on the data paths, inaccurate sampling of the data due to clock drift, etc. On the other hand, “hard” or “persistent” errors may occur due to component failure.
Generally, various error detection code (EDC) and error correction code (ECC) schemes are used to detect and correct memory and/or communication errors. For example, parity protection may be used. With parity, a single parity bit is stored/transmitted for a given set of data bits, representing whether the number of binary ones in the data bits is even or odd. The parity is generated when the set of data bits is stored/transmitted and is checked when the set of data bits is accessed/received. If the parity doesn't match the accessed set of data bits, then an error is detected. Such an approach may, for example, be good for single bit error detection.
Other EDC/ECC schemes assign multiple check bits per set of data bits. The encodings are selected such that a bit error or errors may be detected, and in some cases the encodings may be selected such that the bit or bits in error may be identifiable so that the error can be corrected (depending on the number of bits in error and the ECC scheme being used). Typically, as the number of bit errors that can be detected and/or corrected increases, the number of check bits used in the scheme increases as well.
In addition to the above, there are failure modes where the entire EDC or ECC codeword is misplaced or substituted. In such a case, the error protection codes may not provide protection. Examples of such failures include addressing failures in a memory array, logic failures in control state machines, or various kinds of control failures at structures such as registers, multiplexors, queues or stacks. Under these circumstances, perfectly valid data intended for different transactions can become mixed or swapped, resulting in undetectable errors and silent data corruption.
As efforts have been made to optimize and enhance the memory systems of computing devices, new complexities such as cache hierarchies, coherency protocols, other features have been introduced. Further, memory transactions other than basic reads and writes are now common. In many cases, a unit of memory data may be viewed as a data value, along with its associated state, bound to an address within a well defined address space.
In some systems, optimization of memory system transactions has led to the separation of these components of transaction data and/or state from its associated address. For example, in a system where a piece of memory data may reside in one of several caches, a transaction to update the data value may take the form of an address query through an address only path, followed by the data value (and/or state information) traveling on a different path at a different time in order to complete the transaction. In such an environment, individual error protection on the separate paths for the state, value, and address may not provide fault coverage for the entire transaction.
In view of the above, an effective method and mechanism for detecting errors is desired.