1. Field of the Invention
The present invention pertains to the field of cache memories. More particularly, the present invention pertains to the field of handling errors caused by the loss of information stored in cache memory cells.
2. Description of Related Art
As computer systems are increasingly relied on for business, technical, and personal tasks, the importance of computer system reliability also increases. In particular, complex multiple processor systems are often used in applications in which reliable operation is highly desirable if not crucial. Unfortunately, the risk of a system failure due to a memory error increases as more and more processors and cache memories are added to the system. Thus, it would be advantageous to develop techniques to limit the adverse impact of errors in multiple processor systems and/or multiple cache memory systems.
One particular type of memory failure which is exacerbated by continuing advances in computer systems is that of cache data corruption due to "soft errors." As opposed to "hard errors," which result from design flaws, "soft errors" typically occur due to extra terrestrial radiation (e.g., cosmic rays) or alpha particle radiation from packaging materials. Thus, while a hard error can be repeated by subjecting the computer system to the exact same operating sequence, soft errors are more like random events, and thus are inherently unpredictable.
One major source of soft errors is the impingement of radiation such as alpha particles upon a semiconductor device. In particular, memory cells are very susceptible to alpha particle radiation because the impinging alpha particle may reverse the charge used to store data within the memory cell. When such a soft error is caused in a memory cell being used to store data or instructions for a program, a system failure may occur if error detection and containment or recovery is not performed.
As technological advances allow smaller and smaller memory devices, the likelihood of a soft error increases for two reasons. First, a smaller memory cell stores less charge and accordingly is more easily discharged or reversed by the impact of an alpha particle. Secondly, the decreasing cell size allows arrays of larger numbers of cells, further increasing the number of possible error locations. Thus, both increasing numbers and decreasing charge storage of memory cells increase the soft error problem.
Since modern high power computer systems extensively use cache memories to increase system performance, the detection and correction of soft errors in system cache memories has become increasingly important. Furthermore, since many processors include integrated cache memories, the use of multiple processors in a system heightens the risk of memory errors which may adversely affect system operation. Thus, a technique which effectively deals with soft errors in a multi-processing system without disrupting system operation could increase reliability of such systems.
One prior art approach to dealing with cache errors in a multi-processing system is exemplified by the technique used in the Pentium.RTM. Processor line available from Intel Corporation of Santa Clara, Calif. Intel Pentium.RTM. II processors may enable an internally generated machine check exception to deal with certain cache snoop errors which are caused by cache data corruption. The machine check exception causes the processor to run routines which may determine whether or not it is possible to recover from the snoop error. A recoverable error may occur if the cache line was in any state except for modified since unmodified cached data can be found elsewhere in the system.
Some prior art systems employ data integrity tests for the data stored within the cache data entries. An error correction code may be used to detect not only when an entry has been corrupted, but also how that corrupted entry might be restored. Such data recovery requires substantial overhead to store sufficient additional error checking bits to reconstruct the corrupted data and to perform the correction.
Cache tag arrays are often much smaller than the actual cache data arrays and consequently may be less likely to become corrupted. A tag array failure, however, prevents a cache memory from accurately determining whether the cache contains a particular address. Thus, in prior art systems, cache snoop cycles cannot be properly performed if the tags are corrupted. Since cache snoop cycles are crucial in most systems to maintaining system cache coherency, the inability to handle snoop errors without disrupting system operation may be detrimental to overall system performance.
One reason for the inability of prior art systems to handle cache snoop errors in a manner which limits system disruption is that prior art systems do not report snoop errors to the entire system during the snoop bus cycle. Not reporting snoop errors to other caches or cache control logic in the system prevents the snooping bus agent from making a decision whether the snoop error can be ignored at least temporarily, and least by the snooping bus agent. Indeed, there are situations where interrupting system operation (e.g., using the mentioned machine check interrupt) may unnecessarily slow system operation because the interrupt occurs regardless of whether other bus agents in the system could immediately provide the requested information without resolving the corrupt data problem.
For example, if a snooping bus agent encounters corrupt data in a first cache, but a second system cache contains a valid copy of the data, the snooping bus agent could obtain the valid data from the second cache memory and continue operating. Thus, the snooping bus agent could be improved to at least temporarily ignore the error during the snoop cycle which accessed the corrupt data; however, the first cache may also need to respond to the detection of the corrupt entry.
Unfortunately, prior art systems which do not communicate snoop errors at the system level can not determine whether the corrupt data can be ignored by the snooping bus agent. Furthermore, prior art systems do not specifically deal with corrupted tag entries at a system level. This results, in some cases, in a routine (e.g., a machine check routine) being run in an attempt to resolve the corrupted cache entry problem prior to the time at which it is actually required. In other cases, it may be more than a matter of merely delaying an inevitable error recovery routine since the cache with a corrupt entry may be flushed or the corrupt entry replaced, altogether obviating the need to run the recovery routine.
Thus, the prior does not provide a system which adequately allows cache snooping errors to be handled in a non-disruptive manner. Containing and recovering from memory errors such as snoop errors is becoming increasingly important as increasing memory sizes, decreasing device sizes, and increasing levels of multiprocessing elevate the overall risk of a memory error somewhere in the computer system. Thus, it would be advantageous to reduce the disruption of the operation of a system which encounters an error while snooping other system caches.