The present invention relates generally to computer memory, and more particularly to providing sparing for a memory system.
Computer systems often require a considerable amount of high speed random access memory (RAM) to hold information, such as data and programs, temporarily when a computer is powered and operational. This information is normally binary, composed of patterns of 1's and 0's known as bits of data. The bits of data are often grouped and organized at a higher level. A byte, for example, is typically composed of eight bits; more generally these groups or bytes are called symbols and may consist of any number of bits or sub-symbols.
Memory device densities have continued to grow as computer systems have become more powerful. Currently it is not uncommon to have the RAM content of a single computer be composed of hundreds of trillions of bits. Unfortunately, the failure of just a portion of a single RAM device can cause the entire computer system to fail. When memory errors occur, which may be “hard” (repeating) or “soft” (one-time or intermittent) failures, these failures may occur as single cell, multi-bit, full chip or full memory module failures and all or part of the system RAM may be unusable until it is repaired. Repair turn-around-times can be hours or even days, which can have a substantial impact to a business dependent on the computer systems.
The probability of encountering a RAM failure during normal operations has continued to increase as the amount of memory storage in contemporary computers continues to grow.
Techniques to detect and correct bit errors have evolved into an elaborate science over the past several decades. These error detection and error correction techniques are commonly used to restore data to its original/correct form in noisy communication transmission media or for storage media where there is a finite probability of data errors due to the physical characteristics of the device. Memory devices generally store data as voltage levels representing a 1 or a 0 in RAM and are subject to both device failure and state changes due to high energy cosmic rays and alpha particles.
A group of memory chips or dies in a memory device (e.g., dynamic random-access memory or DRAM), referred to as a rank, are positioned adjacent one another on a layer of the memory device. In some cases, a single memory error may be identified and corrected by code in the memory system, while multiple errors or failures at a selected point in time may not be identified and corrected, as error correction systems are unable to detect more than two errors at a time. Accordingly, in some cases when one or more chips of the rank fail or experience an error, the entire rank is taken offline or disabled to prevent the memory failures in that rank from adversely affecting system performance.