Embodiments of the present invention relate generally to error detection and/or correction in a semiconductor device.
Single bit upsets or errors from transient faults have emerged as a key challenge in semiconductor design. These faults arise from energetic particles, such as neutrons from cosmic rays and alpha particles from packaging material. These particles generate electron-hole pairs as they pass through a semiconductor device. Transistor source and diffusion nodes can collect these charges. A sufficient amount of accumulated charge may change the state of a logic device such as a static random access memory (SRAM) cell, a latch, or a gate, thereby introducing a logical error into the operation of an electronic circuit. Because this type of error does not reflect a permanent failure of the device, it is termed a soft or transient error.
Soft errors become an increasing burden for designers as the number of on-chip transistors continues to grow. The raw error rate per latch or SRAM bit may be projected to remain roughly constant or decrease slightly for the next several technology generations. Thus, unless error protection mechanisms are added or more robust technology (such as fully-depleted silicon-on-insulator) is used, a device's soft error rate may grow in proportion to the number of devices added in each succeeding generation. Additionally, aggressive voltage scaling may cause such errors to become significantly worse in future generations of chips.
Bit errors may be classified based on their impact and the ability to detect and correct them. Some bit errors may be classified as “false errors” because they are not read, do not matter, or they can be corrected before they are used. The most insidious form of error is silent data corruption (“SDC”), where an error is not detected and induces the system to generate erroneous outputs. To avoid silent data corruption, designers often employ error detection mechanisms, such as parity. Error correction techniques such as error correcting codes (ECC) may also be employed to detect and correct errors, although such techniques cannot be applied in all situations.
In a multiprocessor system, each individual processor core typically includes an internal cache memory, and often a hierarchy of internal caches. Furthermore, each processor often has a portion of the system's main memory locally attached. Because the main memory is shared by all processor cores and is also accessed and cached locally within each core or node, coherency mechanisms are needed to ensure that operations performed on memory maintain coherency.
A prior art cache coherency protocol includes a plurality of states in which a cache line may reside, namely mutual, exclusive, shared, and invalidated (MESI) states. Modern multiprocessor systems often employ a directory-based coherence mechanism, in which the state of a memory block in each of the caches is maintained in a table referred to as a directory. Such directories often include parity bits and/or error correction codes (ECC) in order to detect and correct errors occurring in the directory. However, these mechanisms consume real estate and also incur power penalties and processing time.
Accordingly, a need exists to improve error detection and correction mechanisms in a directory-based cache coherency protocol.