An information store such as a cache can be used to reduce the time that it takes a processor to access memory for instructions and data. A cache is a smaller but faster memory that stores copies of information from the most frequently used main memory locations. Nearly all microprocessor systems employ cache memories for this performance benefit. Typically, cache memories consist of a Level 1 (L1)cache, which is relatively small and internal to the processor itself, and a larger Level 2 (L2)cache, which is implemented using external Synchronous Static Random Access Memory (SSRAM) devices that are not Error Correction Coding (ECC) protected.
Single- or multi-bit errors that affect L2 cache memory contents may arise in any of various scenarios. Errors may occur, for example, during transfer of data or instructions (writes) from a Central Processing Unit (CPU) main memory to the cache, during transfer of data or instructions (reads) from the cache to the CPU, or during modification of data in the cache as instructions are executed. Cache memory data or instruction contents can also be corrupted due to soft errors, firm errors, and/or hard errors while information is stored in an external memory device.
The causes of data or instruction corruption in CPU to L2 cache memory systems may include, for example, any or all of marginal timing variations occurring naturally in a design due to component and/or manufacturing differences that affect operational characteristics of components, memory or other component manufacturing defects that cause intermittent “glitches” in a system under a specific set of conditions, and soft-errors due to external phenomena such as cosmic rays.
SSRAM devices, which are often used to implement cache memories, tend to be susceptible to a number of factors such as temperature, humidity, equipment slot for electronic card-based implementations, noise, etc. Another error influencer is ionizing radiation or cosmic rays that occur naturally in the environment. The density of SSRAM memory cells is such that if they encounter one of these high energy particles, the value or bit in an SSRAM memory cell can be changed, an effect known as bit flipping.
As noted above, external L2 cache memories are not normally ECC protected. Even if ECC protection were provided for an L2 cache, the issue of error handling would not be completely solved since ECC schemes have limited error correction capabilities. Therefore, in external L2 cache applications, single- and most multi-bit errors in the L2 cache memory are detected on the CPU as parity errors.
Errors and corruption are generally considered serious enough to halt execution of the CPU entirely so as to eliminate the risk of processing a “bad instruction” or proceeding with processing based on corrupted data. The CPU is then reset as a result of the error/corruption. However, if the error/corruption does not affect data that has been modified only in the cache (i.e., data that has not been synchronized between the cache and the main memory), this simple response is exaggerated and may cause a long and unnecessary interruption in services inherent in the CPU.
Traditional approaches for responding to or correcting L2 cache errors include detecting an error in software and triggering a system crash, which requires a complete reset to recover from the detected error and can result in a significant disruption in a software application or the operation of a communication network, for example. Some systems may employ software processes that periodically “flush-out” cache memories to main memory during idle times to mitigate the effect of soft-errors on stored data. Hardware-based ECC approaches that use ECC for detecting and correcting single-bit errors can mitigate the impact of errors or corruption, but do not eliminate the problem in that multi-bit errors can be detected but not corrected. ECC-based error checking also tends to be slow.
Thus, there remains a need for improved information error recovery mechanisms.