1. Field of the Invention
This invention is related to the field of processors and computer systems and, more particularly, to error correction code (ECC) mechanisms in processors and computer systems.
2. Description of the Related Art
Modern processors are more frequently being designed to offer reliability features. For example, reliability features are often demanded in large computing systems such as servers. Generally, reliability features reduce the likelihood that erroneous operation of the processor, or software executing thereon, causes erroneous operation of the system as a whole. At the same time, semiconductor fabrication technology improvements continue to shrink the size of the circuits used to form processors. The smaller circuitry, and the more dense packing of circuits made possible by the reduced size, increases the possibility that so-called “soft” errors may be experienced by the processor. Generally, a soft error is an error caused by the occurrence of an event, rather than a defect in the circuitry itself (which produces a “hard” error). Soft errors are intermittent, whereas hard errors occur repeatedly and predictably. Soft errors may occur due to an excessive amount of noise near a circuit, random alpha particles striking the circuit, etc.
Of particular concern for soft errors are various memory arrays, which generally are the most densely packed circuitry within the processor. A soft error may cause one or more bits stored within the memory to change state (e.g. from a binary one to a binary zero, or vice versa). If the changed bits are subsequently accessed, the erroneously changed values may propagate, eventually causing erroneous operation on a larger scale (e.g. reduced reliability of the processor as a whole or reduced reliability of the system including the processor as a whole).
Two popular schemes for protecting against soft errors are parity and error correction code (ECC) schemes. With parity, a single parity bit is stored for a given set of data bits, representing whether the number of binary ones in the data bits is even or odd. The parity is generated when the set of data bits is stored and is checked when the set of data bits is accessed. If the parity doesn't match the accessed set of data bits, then an error is detected. However, with parity, errors cannot be corrected.
ECC schemes assign several ECC bits (“ECC data”) per set of data bits in the memory. The ECC bits are encoded from various overlapping combinations of the corresponding data bits. The encodings are selected such that not only can a bit error or errors be detected, but the bit or bits in error may be identifiable so that the error can be corrected (depending on the number of bits in error and the ECC scheme being used). In ECC schemes, there are two types of errors that may be detected (referred to as “ECC errors”herein). A “correctable ECC error” is an ECC error in which the bit or bits in error are identifiable so that a correction may be performed (e.g. by inverting the identified bits). An “uncorrectable ECC error” is an ECC error in which the error is detected but the bits in error are not identifiable and thus the error cannot be corrected. Depending on the number of ECC bits and the corresponding number of data bits, the maximum number of bits which may be in error for a correctable ECC error and the maximum number of bits which may be in error for an uncorrectable ECC error may vary. Generally, the number of bit errors constituting a correctable error is less than the number of bit errors constituting an uncorrectable error for a given ECC scheme. For example, one ECC scheme is the single error correct, multiple error detect (SEC-MED) scheme. In the SEC-MED scheme, a single bit error (per set of data bits and corresponding ECC bits) is a correctable error and multiple bit errors are detectable but uncorrectable errors. Other schemes may be capable of correcting double bit errors and detecting larger numbers of bit errors, etc.
Like parity, the ECC data is generated when the corresponding data bits are stored in a memory. The ECC data is also stored in the memory or another memory provided for storing ECC data, and thus the ECC data itself is subject to possible error. The encoding scheme for the ECC data allows for errors in the data being protected and in the ECC data to be detected (and corrected, if applicable).
Typically, both ECC error detection and correction are implemented in hardware. The hardware typically used to perform ECC correction is provided access to the memory storing the ECC-protected data, so that the corrected data may be written to the memory. Generally, the ECC correction hardware may share a port to the memory with sources of memory accesses. When the ECC correction hardware requires access to correct an error, the other sources are temporarily not able to access the memory. Arbitration circuitry for the port may require some amount of time to select among the access sources and the ECC correction hardware.