1. Field of the Invention
This invention is related to the field of processors and computer systems and, more particularly, to error correction code (ECC) mechanisms in processors and computer systems.
2. Description of the Related Art
Modern processors are more frequently being designed to offer improved reliability features. For example, reliability features are often demanded in large computing systems such as servers. Generally, reliability features reduce the likelihood that erroneous operation of the processor, or software executing thereon, causes erroneous operation of the system as a whole. At the same time, semiconductor fabrication technology improvements continue to shrink the size of the circuits used to form processors. The smaller circuitry, and the more dense packing of circuits made possible by the reduced size, increases the possibility that so-called “soft” errors may be experienced by the processor. Generally, a soft error is an error caused by the occurrence of an event, rather than a defect in the circuitry itself (which produces a “hard” error). Soft errors are intermittent, whereas hard errors occur repeatedly and predictably. Soft errors may occur due to an excessive amount of noise near a circuit, random alpha particles striking the circuit, etc.
Of particular concern for soft errors are various memory arrays, which generally are the most densely packed circuitry within the processor. A soft error may cause one or more bits stored within the memory to change state (e.g. from a binary one to a binary zero, or vice versa). If the changed bits are subsequently accessed, the erroneously changed values may propagate, eventually causing erroneous operation on a larger scale (e.g. reduced reliability of the processor as a whole or reduced reliability of the system including the processor as a whole).
Two popular schemes for protecting against soft errors are parity and error correction code (ECC) schemes. With parity, a single parity bit is stored for a given set of data bits, representing whether the number of binary ones in the data bits is even or odd. The parity is generated when the set of data bits is stored and is checked when the set of data bits is accessed. If the parity doesn't match the accessed set of data bits, then an error is detected. However, while utilizing parity bits may enable detection of errors, parity bits do not provide the ability to correct these errors.
In contrast to parity, ECC schemes assign several ECC bits (“ECC data”) per set of data bits in the memory. The ECC bits are encoded from various overlapping combinations of the corresponding data bits. The encodings are selected such that not only can a bit error or errors be detected, but the bit or bits in error may also be identified so that the error can be corrected (depending on the number of bits in error and the ECC scheme being used). In ECC schemes, there are two types of errors that may be detected (referred to as “ECC errors” herein). A “correctable ECC error” is an ECC error in which the bit or bits in error are identifiable so that a correction may be performed (e.g. by inverting the identified bits). An “uncorrectable ECC error” is an ECC error in which the error is detected but the bits in error are not identifiable and thus the error cannot be corrected. Depending on the number of ECC bits and the corresponding number of data bits, the maximum number of bits which may be in error for a correctable ECC error and the maximum number of bits which may be in error for an uncorrectable ECC error may vary. Generally, the number of bit errors constituting a correctable error is less than the number of bit errors constituting an uncorrectable error for a given ECC scheme. For example, one ECC scheme is the single error correct, multiple error detect (SEC-MED) scheme. In the SEC-MED scheme, a single bit error (per set of data bits and corresponding ECC bits) is a correctable error and multiple bit errors are detectable but uncorrectable errors. Other schemes may be capable of correcting double bit errors and detecting larger numbers of bit errors, etc.
Like parity, the ECC data is generated when the corresponding data bits are stored in a memory. The ECC data is also stored in the memory or another memory provided for storing ECC data, and thus the ECC data itself is subject to possible error. When the data bits are later accessed, the ECC bits are regenerated and compared to the ECC bits which were stored with the data. The encoding scheme for the ECC data allows for errors in the data being protected and in the ECC data to be detected (and corrected, if applicable).
Typically, the number of bits that can be read out of the cache in one read access is referred to as the “cache width”. For example, in a cache whose width is 64-bits, each cache read will read 64 bits. However, microprocessors may be configured to access quantities of data which are of different sizes than the cache width. For example, a microprocessor with a 32 bit word may be configured to access data sizes such as a byte (8 bits), half word (16 bits), word (32 bits), and double word (64 bits). Therefore, even though 64-bits are read from the cache, a byte load will use only one of the bytes, a half word load will use only two of the bytes and a double word load will use only 4 of the bytes.
In order to generate ECC check bits for a given unit of data, all of the corresponding data bits must be available as input to the ECC generation logic. Consequently, when a write operation to only part of the data is performed, the rest of the original data which is stored at that location must first be fetched, merged into the changes from the current store, and then the new ECC check bits are calculated and stored. Generally, data accesses and corresponding ECC operations may be pipelined. For example, a possible read pipeline may include reading data from a cache in a first clock cycle, broadcasting the data and generating ECC data on a second clock cycle, reading corresponding ECC data on a third clock cycle, and computing ECC errors in a fourth clock cycle. If no ECC errors are detected, the ECC operation is complete. However, if an ECC error is detected, the previously broadcast data must be canceled, the data corrected, and the correct data rebroadcast.
As the example above illustrates, ECC errors may delay the completion of a load by several cycles. Additionally, canceling of the first data broadcast may result in a scheduler having to cancel all operations which are dependent on this load data and reissuing them. As an alternative to cancel and rebroadcast, all load data broadcasts may be delayed until the ECC error checks have been completed. However, such an approach extends the load pipeline and is not a desirable solution, especially because ECC errors may not be very common. Because of the delays caused by these ECC errors, reducing the probability that such an error will be read in the first place is desirable.
One scheme that is used to reduce the probability of reading erroneous data is to use a “scrubber” that periodically inspects a line of the cache or memory for ECC errors. If the scrubber finds an error, it is corrected. Such a technique is designed to lower the probability that a real load would hit the ECC error. Typically, the scrubber may start at one end of the cache (for example address 0) and make its way across the entire cache (address 1, 2, etc. till it gets to the maximum address, then back to 0 again). However, utilizing this approach, the scrubber may end up working on one part of the cache while load accesses are going to a different part of the cache and still hitting the ECC errors.