This invention relates generally to computer memory, and more particularly to providing a high fault tolerant memory system.
Computer systems often require a considerable amount of high speed RAM (random access memory) to hold information such as operating system software, programs and other data while a computer is powered on and operational. This information is normally binary, composed of patterns of 1's and 0's known as bits of data. The bits of data are often grouped and organized at a higher level. A byte, for example, is typically composed of 8 bits; more generally these groups are called symbols and may consist on any number of bits.
Computer RAM is often designed with pluggable subsystems, often in the form of modules, so that incremental amounts of RAM can be added to each computer, dictated by the specific memory requirements for each system and application. The acronym, “DIMM” refers to dual in-line memory modules, which are perhaps the most prevalent memory module currently in use. A DIMM is a thin rectangular card comprising one or more memory devices, and may also include one or more of registers, buffers, hub devices, and/or non-volatile storage (e.g., erasable programmable read only memory or “EPROM”) as well as various passive devices (e.g. resistors and capacitors), all mounted to the card.
DIMMs are often designed with dynamic memory chips or DRAMs that need to be regularly refreshed to prevent the data stored within them from being lost. Originally, DRAM chips were asynchronous devices, however contemporary chips, synchronous DRAM (SDRAM) (e.g. single data rate or “SDR”, double data rate or “DDR”, DDR2, DDR3, etc) have synchronous interfaces to improve performance. DDR devices are available that use pre-fetching along with other speed enhancements to improve memory bandwidth and to reduce latency. DDR3, for example, has a standard burst length of 8, where the term burst length refers to the number of DRAM transfers in which information is conveyed from or to the DRAM during a read or write. Another important parameter of DRAM devices is the number of I/O pins that it has to convey read/write data. When a DRAM device has 4 pins, it is said that it is a “by 4” (or ×4) device. When it has 8 pins, it is said that it is a “by 8” (or ×8) device, and so on.
Memory device densities have continued to grow as computer systems have become more powerful. Currently it is not uncommon to have the RAM content of a single computer be composed of hundreds of trillions of bits. Unfortunately, the failure of just a portion of a single RAM device can cause the entire computer system to fail. When memory errors occur, which may be “hard” (repeating) or “soft” (one-time or intermittent) failures, these failures may occur as single cell, multi-bit, full chip or full DIMM failures and all or part of the system RAM may be unusable until it is repaired. Repair turn-around-times can be hours or even days, which can have a substantial impact to a business dependent on the computer systems.
The probability of encountering a RAM failure during normal operations has continued to increase as the amount of memory storage in contemporary computers continues to grow.
Techniques to detect and correct bit errors have evolved into an elaborate science over the past several decades. Perhaps the most basic detection technique is the generation of odd or even parity where the number of 1's or 0's in a data word are “exclusive or-ed” (XOR-ed) together to produce a parity bit. For example, a data word with an even number of 1's will have a parity bit of 0 and a data word with an odd number of 1's will have a parity bit of 1, with this parity bit data appended to the stored memory data. If there is a single error present in the data word during a read operation, it can be detected by regenerating parity from the data and then checking to see that it matches the stored (originally generated) parity.
More sophisticated codes allow for detection and correction of errors that can affect groups of bits rather than individual bits; Reed-Solomon codes are an example of a class of powerful and well understood codes that can be used for these types of applications.
These error detection and error correction techniques are commonly used to restore data to its original/correct form in noisy communication transmission media or for storage media where there is a finite probability of data errors due to the physical characteristics of the device. The memory devices generally store data as voltage levels representing a 1 or a 0 in RAM and are subject to both device failure and state changes due to high energy cosmic rays and alpha particles.
In the 1980's, RAM memory device sizes first reached the point where they became sensitive to alpha particle hits and cosmic rays causing memory bits to flip. These particles do not damage the device but can create memory errors. These are known as soft errors, and most often affect just a single bit. Once identified, the bit failure can be corrected by simply rewriting the memory location. The frequency of soft errors has grown to the point that it has a noticeable impact on overall system reliability.
Memory Error Correction Codes (ECC) use a combination of parity checks in various bit positions of the data word to allow detection and correction of errors. Every time data words are written into memory, these parity checks need to be generated and stored with the data. Upon retrieval of the data, a decoder can use the parity bits thus generated together with the data message in order to determine whether there was an error and to proceed with error correction if feasible.
The first ECCs were applied to RAM in computer systems in an effort to increase fault-tolerance beyond that allowed by previous means. Binary ECC codes were deployed that allowed for double-bit error detection (DED) and single-bit error correction (SEC). This SEC/DED ECC also allows for transparent recovery of single bit hard errors in RAM.
Scrubbing routines were also developed to help reduce memory errors by locating soft errors through a scanning of the memory whereby memory was read, corrected if necessary and then written back to memory.
Some storage manufacturers have used advanced ECC techniques, such as Reed-Solomon codes, to correct for full memory chip failures. Some memory system designs also have standard reserve memory chips (e.g. “spare” chips) that can be automatically introduced in a memory system to replace a faulty chip. These advancements have greatly improved RAM reliability, but as memory size continues to grow and customers' reliability expectations increase, further enhancements are needed.
FIG. 1 depicts a contemporary prior art system composed of an integrated processor chip 100, which contains one or more processor elements and an integrated memory controller 110. In the configuration depicted in FIG. 1, multiple independent cascade interconnected memory interface busses 106 are logically aggregated together to operate in unison to support a single independent access request at a higher bandwidth with data and error detection/correction information distributed or “striped” across the parallel busses and associated devices.
The memory controller 110 attaches to four narrow/high speed point-to-point memory busses 106, with each bus 106 connecting one of the several unique memory controller interface channels to a cascade interconnect memory subsystem 103 (or memory module, e.g., a DIMM) which includes at least a hub device 104 and one or more memory devices 109. Some systems further enable operations when a subset of the memory busses 106 are populated with memory subsystems 103. In this case, the one or more populated memory busses 108 may operate in unison to support a single access request.
FIG. 2 depicts a prior art memory structure with cascaded memory modules 103 and unidirectional busses 106. One of the functions provided by the hub devices 104 in the memory modules 103 in the cascade structure is a re-drive function to send signals on the unidirectional busses 106 to other memory modules 103 or to the memory controller 110.
FIG. 2 includes the memory controller 110 and four memory modules 103, on each of two memory busses 106 (a downstream memory bus with 24 wires and an upstream memory bus with 25 wires), connected to the memory controller 110 in either a direct or cascaded manner. The memory module 103 next to the memory controller 110 is connected to the memory controller 110 in a direct manner. The other memory modules 103 are connected to the memory controller 110 in a cascaded manner. Although not shown in this figure, the memory controller 110 may be integrated in the processor 100 and may connect to more than one memory bus 106 as depicted in FIG. 1.
The connection between a hub in a DIMM and a memory controller may have transmission errors and therefore such a connection may be protected using error detection codes. In these types of designs, the memory controller checks a detection code during a read and if there is a mismatch, it issues a retry request for the faulty read (and possibly other read requests that happened in the near time vicinity). To support such retry mechanisms, the memory controller maintains a queue of pending requests which is used to determine which requests.
The evolution of the minimal burst length parameter of DRAM devices has been such that it makes it increasingly more difficult to provide for desirable error correction properties such as multiple chipkill support. The trend for such minimal burst length has to increase as new DRAM technologies are introduced.
As an illustrative example, assume that a processor has a cache line of 128B, and that ancillary information totaling 4 additional bytes needs to be stored and protected together with the cache line. Such ancillary information will vary from processor design to processor design. Again for illustrative purposes, suppose the additional information is comprised of a flag indicating whether the data was corrupted even before reaching memory (the SUE flag), tag bits that can be used in data structures and a node bit that indicates whether a more recent copy of the cache line may exist elsewhere in the system.
In the DDR3 generation of DRAM devices, the minimal burst length on each device is equal to 8 transfers. Therefore a ×4 DRAM device (which by definition has 4 I/O pins) delivers/accepts a minimum of 32 bits (4 bytes) on each read/write access. Correspondingly, a ×8 DRAM device delivers/accepts a minimum of 64 bits (8 bytes) on each read/write access. Assuming a processor cache line of size 128 bytes, and assuming that for every 8 data chips there is an additional 9th chip that provides additional storage for error correction/detection codes, a simple calculation demonstrates that a total of 36×4 devices can be accessed in parallel to supply a total of 144 bytes (out of which 128 bytes are for data, and 4 bytes are for ancillary information). Similarly, a total of 18×8 devices can be accessed in parallel to supply a total of 144 bytes.
As we stated earlier, it is highly desirable for an error correction code to provide for the ability to survive a chipkill. Unfortunately, those skilled in the art will recognize that while it is possible to allow for chipkill recovery in the setting where 2 of the 18 chips are completely devoted to redundant checks, once the additional ancillary information is introduced as a storage requirement it becomes mathematically impossible to allow for the recovery of chipkills with 100% certainty.
One alternative is to construct a memory using ×4 parts instead, since in this memory geometry a total of 32 devices may be devoted to data, the 33rd device may be devoted to the ancillary information which would leave 3 additional chips for redundant information. Such redundancy will allow, as those skilled in the art will recognize, to have single chip error correct/double chip error detect capabilities for the system.
A strong reason for not using ×4 parts nonetheless is related to power consumption. Assume that ×4 and ×8 parts have identical storage capacity. Contrasting two systems with exactly the same number of chips, but one with ×4 chips and the other one with ×8 chips, the same amount of “standby” power is incurred in both (standby power is the amount of power paid in the absence of any memory activity).
Nonetheless, every time an access is made to memory, in the ×4 memory configuration a total of 36 devices are activated simultaneously, as opposed to the ×8 situation where only 18 devices are activated simultaneously. Therefore, the “active” power (paid during memory accesses) is double in the ×4 setting than in the ×8 setting.