One or more aspects of the invention relate, in general, to computer memory, and in particular, to managing computer memory to mitigate effects of a fault of the computer memory.
Computer systems often require a considerable amount of high speed memory, such as random access memory (RAM), to hold information, including operating system software, virtual machine images, application programs and other data, while a computer is powered on and operational. This information is normally binary, composed of patterns of 1's and 0's, known as bits of data. This binary information is typically loaded into RAM from nonvolatile storage, such as hard disk drives (HDD), during power on and initial program load (IPL) of the computer system.
Computer RAM is often designed with pluggable modules so that incremental amounts can be added to each computer as dictated by the specific memory requirements for each system and application. One example of such a pluggable module is the Dual In-Line Memory Module (DIMM), which is a thin rectangular card with several memory chips mounted on the card. DIMMs are often designed with dynamic random access memory (DRAM) chips that are to be regularly refreshed to prevent the data they are holding from being lost. Originally, DRAM chips were asynchronous devices, but newer chips, SDRAM (synchronous dynamic random access memory), have synchronous interfaces to improve performance. Eventually, Double Data Rate (DDR) devices surfaced that use pre-fetching, along with other speed enhancements, to improve memory bandwidth and reduce latency.
The size of RAM has continued to grow as computer systems have become more powerful. Currently, it is not uncommon to have a single computer RAM composed of hundreds of trillions of bits. The failure of just a single RAM bit can cause the entire computer system to fail. When hard errors occur, either single cell, multi-bit, full chip or full DIMM failures, all or part of the system RAM may remain down until it is repaired. This can be hours or even days, which can have a substantial impact on a business dependent on the computer system.