1. Technical Field
The present invention relates generally to data processing and, in particular, to error handling in a data processing system.
2. Description of the Related Art
A conventional symmetric multiprocessor (SMP) computer system, such as a server computer system, includes multiple processing units all coupled to a system interconnect, which typically comprises one or more address, data and control buses. Coupled to the system interconnect is a system memory, which represents the lowest level of volatile memory in the multiprocessor computer system and which generally is accessible for read and write access by all processing units. In order to reduce access latency to instructions and data residing in the system memory, each processing unit is typically further supported by a respective multi-level cache hierarchy, the lower level(s) of which may be shared by one or more processor cores.
Cache memories are commonly utilized to temporarily buffer memory blocks that might be accessed by a processor in order to speed up processing by reducing access latency introduced by having to load needed data and instructions from memory. In some multiprocessor (MP) systems, the cache hierarchy includes at least two levels. The level one (L1), or upper-level cache is usually a private cache associated with a particular processor core and cannot be directly accessed by other cores in an MP system. Typically, in response to a memory access instruction such as a load or store instruction, the processor core first accesses the directory of the upper-level cache. If the requested memory block is not found in the upper-level cache, the processor core then accesses one or more lower-level caches (e.g., level two (L2) or level three (L3) caches) for the requested memory block
Conventional cache hierarchies are subject to hard errors due to hardware failure and soft errors due to cosmic radiation and other transient electromagnetic events. The hard errors include “stuck bit” errors in which a single memory cell of a cache entry fails, thus causing a persistent stuck bit correctable error condition. The cache hardware typically contains error correction code (ECC) logic to correct such single-bit correctable error conditions.
In some prior art systems, additional action is taken to prevent a stuck bit condition from devolving to an uncorrectable multi-bit error condition. In particular, the cache entry containing the stuck bit can be taken off-line by performing a “line delete.” In conventional systems, a “line delete” requires the cache hardware to record the specific entries for which a correctable error condition occurs in a software-monitored status register. Software monitors the status register, logs the entries recorded in the status register, runs a heuristic to determine if any of the specific entries has a stuck bit, and if so, issues a “line delete” command instructing the cache hardware to take the given entry off-line, for example, by setting a bit in the cache directory.
Tracking cache accesses and reporting to software the precise cache entries in which all correctable errors occur is expensive in terms of the amount of required hardware. Further, the software required to monitor for, detect, and then address frequently occurring correctable errors through line delete actions is difficult to correctly code and to test.