The present disclosure relates generally to error handling in a cache memory and, in particular, to a method, system, and computer program product for handling errors in a cache memory without processor core recovery.
In a microprocessor system, the data stored in caches is protected from consumption using various techniques during bit-flips, e.g., errors caused by cosmic rays or alpha particles. Techniques used to detect and correct bit error have evolved into an elaborate science over the past several decades. Perhaps the most basic detection technique is the generation of odd or even parity where the number of 1's or 0's in a data word is “exclusive or-ed” (XOR-ed) together to produce a parity bit. For example, a data word with an even number of 1's will have a parity bit of 0 and a data word with an odd number of 1's will have a parity bit of 1, with this parity bit data appended to the stored memory data. If there is a single error present in the data word during a read operation, it can be detected by regenerating parity from the data and then checking to see that it matches the stored (originally generated) parity.
It was soon discovered that this parity technique could be extended to not only detect errors, but correct errors by appending an XOR field, an error correction code (ECC) field, to each code word. The ECC field is a combination of different bits in the word XOR-ed together so that errors (small changes to the data word) can be easily detected, pinpointed and corrected.
These error detection and error correction techniques are commonly used to restore data to its original/correct form in noisy communication transmission media or for storage media where there is a finite probability of data errors due to the physical characteristics of the device. The memory devices generally store data as voltage levels representing a 1 or a 0 in RAM and are subject to both device failure and state changes due to high-energy cosmic rays and alpha particles. Similarly, hard disk drives that store 1's and 0's as magnetic fields on a magnetic surface are also subject to imperfections in the magnetic media and other mechanisms that can cause changes in the data pattern from what was originally stored.
When data or instructions are read out of a cache, it is checked to determine whether it is corrupted. If a corruption is detected, the underlying processor should not use the corrupt data, and the corrupt data should be removed from the cache and replaced by good data before the processor continues processing.
In typical machines, if a parity error in a level 1 (L1) cache has been detected in the processor, the processor would go through recovery operations. This may entail signaling the processor to flush the execution pipeline (thereby preventing the corrupt data from being used), followed by resetting all states in the processor to initial states (e.g., FSMs, cache contents, queue contents, etc.). In particular, all cache entries are removed including the corrupt entry. Going through recovery sometimes leads to the termination of the software program currently running on the processor.
If a parity or ECC uncorrectable error was detected in a L2 cache line, the processor that fetched the line would go through recovery in a similar manner as that described above with respect to the L1 cache, and while the processor is recovering, the L2 cache would remove that particular line.
These recovery operations result in increased latency of the system, as the error correction activities must be employed before the processor is able to continue on with its tasks. It would be desirable to provide a means to handle these errors (e.g., parity or ECC) in a cache without performing a processor core recovery operation.