Controllers for DRAMs (dynamic random-access memories) have been getting more complex over time as the data rates to memory have been increasing, but also as the features built into the memory parts have become more elaborate. For example, having multiple memory banks in the memory parts (chips) adds significantly to the design complexity of a controller that attempts to use the capability of such memory parts to better advantage.
Over time and as a result of multiple causes, computer memories will have data errors. Only purchasers of inexpensive PCs tolerate the inconvenience of memories that do not have ECC (error-correction code) circuitry. One common ECC type is an SECDED (single-error correct, double-error detect) ECC. There are numerous different well-known codes that can be used to achieve such a function.
As the density of memory chips keeps increasing, the individual memory bits become more sensitive to upset and therefore to data loss. Data failures that do not result in (or result from) permanent IC failures, such that the memory part still functions correctly, are called soft errors. These soft errors can be caused by familiar mechanisms like alpha particles but also, increasingly, by other mechanisms like other heavy ions and power-supply noise. The sensitivity to data loss increases geometrically as process rules shrink and power-supply voltages are reduced, while the total number of bits per processor also increases geometrically because of user memory-size requirements. This means that soft-error rates for systems coming on line will increase by orders of magnitude over historic error-rate norms.
In the past, soft memory errors have generally been handled by error-correction codes: SECDED and the like. Other correction technologies exist, and are sometimes used, but become increasingly expensive as a fraction of total memory cost as the correction and recovery capability is improved. For example, U.S. Pat. No. 5,745,508 “ERROR-DETECTION CODE” by Thomas Prohofsky, which is incorporated herein by reference, discusses SECDED codes that also detect certain three-bit and four-bit errors; and U.S. Pat. No. 4,319,357 “DOUBLE ERROR CORRECTION USING SINGLE ERROR CORRECTING CODE” by Bossen, which is incorporated herein by reference, discusses correcting certain hard-soft double-error combinations.
U.S. patent application Ser. No. 09/407,428 filed Sep. 29, 1999 and entitled “MULTIPROCESSOR NODE CONTROLLER CIRCUIT AND METHOD” by Deneroff et al. describes a system that can use ECC memory.
All DRAM parts need to be refreshed; that is what the D (dynamic) in the DRAM name indicates: one must cycle the memory repeatedly in order that the dynamic contents (the stored charges) of the capacitive store of each memory bit are regenerated. This “refresh” function is typically managed by having the memory parts themselves perform the refresh operation. This function generally takes place after a specific command is sent from the local memory controller using a specific request rate so that all memory bits are referenced within the required refresh interval.
Some features that have been in some controllers in the past and whose recognized benefits indicate that they are likely to be used in new designs are memory refresh, memory scrubbing, and support for spare bits in memory. Conventional uses for a spare bit include the ability to logically rewire a card that has a stuck bit (a bit that is always zero or one) or a frequently failing signal on a pin of a memory part such that the card can be returned to correct operation without physical access to the failing pin, chip, or card. In the past, such rewiring typically required removing the card from the system. Logic circuits that provide rudimentary versions of these features with the card in place in a system have shortcomings, such as having to stop accesses from the processor, at least as regards a section of memory having a failed bit and perhaps entirely, in order to reconfigure the memory card to have the spare bit to replace the failed bit.
Electrical issues and pin limitations push memory system design in directions that put the memory controller(s) on the memory cards and also push the card interface to have higher data rates per pin in order to reduce the number of pins while keeping the card bandwidth in line with the higher performance needs of the attached processors and of the bandwidth of the memory components on the memory cards. A memory card design that adopts this direction has test issues, in that the memory components (the chips) are not directly accessible for testing as is normal in past industry practice, and the data rates of the high-speed interfaces are too fast for connection to testers that are available in normal production testing. While special-purpose test equipment can be built and used, the design of special-purpose memory testers is very expensive and time consuming.
Thus, there is a need for improved methods and circuits in memory subsystems and for logic functions in which memory performance, reliability (the time between system failures, or the inverse of failure frequency) and availability (the percentage of time the system is up and working) are improved.