Computer memory is subject to errors caused by chip failure and ionizing radiation. Chip failure can result from manufacturing defects, voltage spikes, and combinations thereof. Randomly occurring memory errors caused by ionizing radiation are generally referred to as “soft errors.” Various error correction codes are known and in use that detect and correct for soft errors. A well known error correction code is known as the Hamming code, which was published in 1950 by Richard Hamming. Error correction codes work by appending additional data onto a data segment, wherein the additional data contains sufficient information to detect and/or correct one or more errors in the data segment.
In computing systems, data is stored in main memory which generally comprises a plurality of memory chips which are accessed in parallel. Thus, reading 32 contiguous bits of data from memory in a single read operation could entail reading data from as many as 32 memory chips, with one bit being read from each chip. When one chip fails repeatedly, it can cause the corresponding bit in the read operation to be frequently erroneous. While the bit can generally be corrected using the error correction code applied for that data, it degrades the effectiveness of the error correction and could result in failing to correct legitimate soft errors, which in turn leads to instability of the system.
Previous attempts at resolving this issue have generally revolved around providing redundant or back-up memory devices. For example, a memory board may be on stand-by status and is activated by copying data from a failing memory board when a bad chip is detected. It is also known to kill a single chip and remap the memory to a stand-by or other chip using software or a hardware memory controller. However, previous systems were inefficient. Redundant systems required extra unused memory boards to be present. Previous memory remapping required extensive rerouting and management of memory in the memory controller on the processor silicon, which required expensive real estate on the processor.
There is therefore an unmet need for an improved memory chip kill system and method which does not require excessive processor real estate, is simple to implement, and transparent to the normal operation of the processor.