Error correction is often used for main memory systems using dynamic-random-access memory (DRAM) chips. More recently, error correction is also being applied to cache memories that use static random-access memory (SRAM) or DRAM chips. The larger amounts of data processed by today's higher-performance systems requires a lower error rate than older systems; otherwise system crashes would occur more and more frequently for higher-speed systems.
An error-correction code (ECC) is often stored with a data word in the memory or cache line. For example, 8 bits of ECC may be stored with every 64-bit data word, for a total of 72 bits per word. A wide variety of codes for ECC are known and published in the technical and academic literature.
FIGS. 1A-B show error detection and correction using a SECDED code. A popular class of ECC code is known as single-error-correction, double-error-detection (SECDED). SECDED has the ability to correct a 1-bit error anywhere within the data word, and to detect a longer 2-bit error.
In FIG. 1A, a single-bit error occurs in the data word, at the location indicated by the question mark. Using an ECC field encoded as a SECDED code, an error correction unit can correct the single-bit error. The corrected data may be used in a system such as a processor.
In FIG. 1B, a double-bit error is detected. The two error bits are shown by the “?”. This error exceeds the maximum number of correctable bits (1), but the error still can be detected by the SECDED code. Although the exact location of the error within the data word is not known, detecting the error is still useful since actions can be taken to recover from the detected error. For example, a computer system may be halted before data is over-written with faulty data that could be caused by using this faulty data word. Some computer systems may be able to isolate the program or routine that requested the faulty data word, and this program or routine may be halted while other programs continue running.
FIG. 2 shows an instruction cache with ECC. ECC is also being used to detect errors in cache memories. Instruction cache 10 has data field 12 and ECC field 14 that contain data and associated ECC bits for cache lines. Valid bits 16 are set when valid data is written into a cache line of instruction cache 10, and cleared when a cache line is invalidated, such as during initialization or due to snooping.
When an error is detected in a cache line, ECC fields 14 may be used to try to correct the error, as shown for FIG. 1A. Alternately, when an error is detected, the cache line may be invalidated, or the cache data may be refetched from main memory 18. Since instruction cache 10 contains only instructions, the processor never writes to instruction cache 10. Thus a back-up of all data in cache 10 is available in main memory 18.
The ECC code used may be adjusted to trade off correction and detection capabilities. For example, rather than use a SECDED code that corrects 1-bit errors and detects 2-bit errors, an ECC code that detects 3-bit errors but cannot correct any errors may be used. This is especially useful for radiation-induced soft errors that can alter several adjacent memory cells at the same time. As memory densities increase, the number of bits altered by a single radiation event can increase.
While ECC is useful with instruction caches, write-back caches are more problematic. Write-back caches may contain data that is written by the processor. A copy of the data in the cache line may not yet be available in the main memory when the processor writes directly to the cache and not directly to main memory. While using a 3-bit detect, 0-bit correct ECC code could be useful for an instruction cache, a write-back cache could benefit more from a correcting code, such as a 2-bit detect, 1-bit correct code such as SECDED.
What is desired is a cache system that includes ECC for error correction and detection. ECC for use with a write-back cache is desirable.