The present invention relates to computer architecture, processing and memory systems, and more specifically to Recoverability, Availability and Serviceability (RAS) needs including efficient and selective sparing of bits in memory systems/subsystems.
With recent advancement of information technology and wide use of the internet to store and process information, more and more demands are placed on the acquisition, processing, storage and dissemination of information by computing systems. Computing systems are being developed to increase the speed at which computers are able to execute increasingly complex applications for business, personal use, and entertainment. Overall computer system performance is affected by each of the key elements of the computer structure, including the performance/structure of the processors, any memory caches, input/output (I/O) subsystems, efficiency of the memory control functions, the performance of the memory devices and systems, and any associated memory interface elements, and the type and structure of the memory interconnect interfaces.
The constantly increasing speed of processors which execute increasingly complex applications places more rigorous performance demands on all of the other subsystems in the computer, including the memory subsystem, where data is stored, accessed, and updated numerous times during the operation of an application. The time consumed by memory read/write operations is a major factor in the ultimate speed and efficiency of a computer system. The memory subsystem of most computers is normally operated by a memory controller. The task of memory controllers is to move data between the computer's memory subsystem and its one or more processors as quickly and efficiently as possible. In many memory subsystems, the memory controller may control multiple memory devices. The memory devices may be arranged in ranks and/or channels. A computer's memory subsystem often comprises memory modules, usually one or more dual in-line memory modules (DIMMs) that include several memory devices, e.g., dynamic random access memory (DRAM) devices. The DIMMs may have one or more ranks and channels of memory devices.
Computing demands require the ability to access an increasing number of higher density memory devices at faster and faster access speeds. Extensive research and development efforts are invested by the industry to create improved and or innovative solutions to maximize overall system performance by improving the memory system/subsystem design and/or structure and the methods by which the memory system/subsystem operates. Such efforts have resulted in the development of distributed memory systems, distributed buffer memory systems, registered DIMMs (RDIMMs) and load reduced DIMMs (LRDIMMs), and other systems, specifications and standards such as, for example, DDR4 and DDR5, which provide for increased memory performance.
In one example, a distributed memory system may include a plurality of memory devices, one or more Address Chips (AC), also known as memory control circuits, and a plurality of data circuits, also known as data buffer circuits or DC chips (DC). There are communication links or buses between a host processor and the memory control circuits and data buffer circuits. There is also a communication link or bus from the memory control circuits to the data buffer circuits. There are also communication links between the memory devices, e.g., DRAMS, and the memory control circuits and the data buffer circuits. Bandwidth limitations on communication links can affect performance of memory systems.
As performance of memory systems increases (e.g., speed and capacity), recoverability, availability and serviceability (RAS) are also important considerations. The RAS needs of a high end server or mainframe computer are very different from a low end personal computer. In order to increase reliability and to prevent or at least lower the risk of computer failure, different forms of error detection and correction processes have been developed. One commonly used system for error detection is the use of parity bits to detect errors. While parity bit checking works to determine single bit errors, it does not always work for determining multibit errors, and parity checking systems have no mechanism to correct data errors.