Cache memories have become integral to computer systems. They provide an important performance benefit through minimizing the number of times that it is necessary to read and/or write slower auxiliary memories (such as DRAM) across even slower memory busses. They typically operate at or near the same speed as the processor that they are supporting, and indeed, are often integrated into processors. They also tend to be constructed utilizing the same technologies and feature sizes as the processors.
However, the feature size for cache memories continues to shrink as the speed at which they are required to operate continues to climb, along with that of the processors. As such, the potential for bit errors increases. Meanwhile, the requirement for fault-free operation continues to increase for mission critical and large scale computer systems.
One problem that exists for cache memories, probably more than for any other portion of a computer system, is that bit errors can be extremely harmful to the operation of the entire computer system. Many bit errors detected during processor operation can be recovered from, for example by notifying or aborting the task or job currently executing. Auxiliary memory (such as DRAM) can utilize Error Correction Codes (ECC) that allow automatic single bit correction and detection of most multiple bit errors.
Cache memories on the other hand are required to operate at much higher speeds than slower auxiliary memories. The speed difference may be 5× or maybe even 10× with today's technologies. ECC is thus not realistic, since the time involved to detect and correct these errors would invariably require extra cache memory cycles to perform.
One reason that cache memory bit failures can be so catastrophic to a computer system is that when an error occurs, and if it is detected, it is sometimes not possible (or extremely hard and expensive) to determine the state of the memory of the computer system. For example, if an error is detected in a cache tag, it is not directly possible to determine which auxiliary (DRAM) memory block corresponds to that cache tag. With a 14 bit cache tag) and a single bit error, potentially 14 different blocks of auxiliary memory may be implicated. If the cache memory has ownership of that block of memory, then any one of the potential 14 blocks of auxiliary memory may or may not be valid. Since it is impractical to determine which block of memory is implicated, it is difficult, if not infeasible, to terminate the job or task running in that memory. The only realistic result in some situations then is to reinitialize any processors that may be using that cache memory in order to guarantee that the job or task executing in that memory is terminated.
It would thus be advantageous to have available a mechanism to efficiently detect and compensate for any cache memory address tag bit errors.