The present invention relates generally to computer memory, and more particularly, to bank-level fault management in a memory system.
Computer systems often require a considerable amount of high speed random access memory (RAM) to hold information, such as data and programs, temporarily when a computer is powered and operational. This information is normally binary, composed of patterns of 1's and 0's known as bits of data. The bits of data are often grouped and organized at a higher level. A byte, for example, is typically composed of 8 bits; more generally these groups or bytes are called symbols and may consist of any number of bits or sub-symbols.
Memory device densities have continued to grow as computer systems have become more powerful. Currently it is not uncommon to have the RAM content of a single computer be composed of hundreds of trillions of bits. Unfortunately, the failure of just a portion of a single RAM device can cause the entire computer system to fail. When memory errors occur, which may be “hard” (repeating) or “soft” (one-time or intermittent) failures, these failures may occur as single cell, multi-bit, full chip or full memory module failures and all or part of the system RAM may be unusable until it is repaired. Repair turn-around-times can be hours or even days, which can have a substantial impact to a business dependent on the computer systems.
The probability of encountering a RAM failure during normal operations has continued to increase as the amount of memory storage in contemporary computers continues to grow. Techniques to detect and correct bit errors have evolved into an elaborate science over the past several decades. These error detection and error correction techniques are commonly used to restore data to its original/correct form in noisy communication transmission media or for storage media where there is a finite probability of data errors due to the physical characteristics of the device. Memory devices generally store data as voltage levels representing a 1 or a 0 in RAM and are subject to both device failure and state changes due to high energy cosmic rays and alpha particles.
Error-correcting codes (ECCs) are used in more robust systems and are typically collectively stored in an additional device to detect and correct specific error conditions. Memory devices (e.g., dynamic random access memory or DRAM devices) are often grouped as ranks on a module, such as dual inline memory module (DIMM). Each DRAM can internally include multiple banks and each rank includes multiple DRAMs. ECC decoding to detect and correct bit errors is typically supported at a DRAM per-rank granularity. In some cases, a single bit error may be identified and corrected by a code in the memory system. ECC decoders may also support error detection and correction of more than one bit. In some cases, multiple errors or failures at a selected point in time may not be identified and corrected, as error correction systems are typically unable to detect and/or correct more than certain number of bits at a time. Accordingly, in some cases when one or more chips of a rank fail or experience an error, the entire rank is taken offline or disabled to prevent the memory failures in that rank. This creates a hole in the available memory space and therefore would adversely affect the system performance.