The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for firmware monitoring of memory scrub coverage and related firmware fault isolation techniques.
Memory scrubbing is the process of detecting and correcting bit errors in memory by using error-detecting codes like error correction code (ECC). Due to the high integration density of contemporary computer memory chips, individual memory cell structures have become small enough to be vulnerable to cosmic rays and/or alpha particle emission. The errors caused by these phenomena are called soft errors. The probability of a soft error at any individual memory bit is very small. However, together with the large amount of memory with which computers—especially servers—are equipped and perhaps several months of uptime, the probability of soft errors in the total memory installed becomes significant.
The information in an FCC memory is stored redundantly enough to correct single bit error per memory word. An FCC memory can support the scrubbing of the memory content. If the memory controller scans systematically through the memory, the single bit errors can be detected, the erroneous bit can be determined using the ECC checksum, and the corrected data can be written back to the memory.
It is important to check each memory location periodically before multiple bit errors within the same word are likely to occur, because while one-bit errors can be corrected, multiple-bit errors are not correctable, in the case of typical ECC memory modules. Currently, there is no way to verify whether a hardware scrub, (performed by maintenance logic in a memory controller) has gone through all memory addresses in a system during a memory scrub cycle. In practice, a hardware scrub may hang or get stuck in a loop due to various hardware and firmware defects. During these situations, firmware assumes that the hardware scrub is running without error and is still scrubbing the directed address ranges in memory.
Three critical problems can arise because of failure of a memory scrub to detect potentially fatal memory errors on a given rank when firmware thinks that the hardware has scrubbed on a rank while the scrub never completed through the rank because of some problem. Failure of a scrub to fix soft single cell errors means there is a risk of those errors accumulating over time and lining up to become an unrecoverable error (UE). Also, the scrub may fail to detect errors that could otherwise be repaired via chip/symbol marking or dynamic random access memory (DRAM) steering resulting in UEs. In addition, the scrub may fail to detect UEs that could otherwise be logical memory block (LMB) garded by virtualization to prevent an operating system (OS) or application crash.
Currently, if a scrub does not detect a fatal memory error, the most likely place for the error to be caught is when the virtualization layer, the operating system kernel, or an application tries to access the affected memory location. When that happens, the system is exposed to an application, operating system kernel, or virtualization layer crash. The only way to prevent the crash of the system is to enable memory mirroring for all memory locations.
However, memory mirroring is expensive in terms of implementation and resources being used during mirrored operations. This added redundancy negatively impacts performance. In cases where only the virtualization layer is mirrored, then the operating system kernel and applications are still vulnerable to undetected, potentially fatal memory errors.