1. Field of the Invention
The present invention relates generally to memory controllers and, more particularly, to the long-term storage of detected memory errors.
2. Description of the Related Art
A memory controller makes a dynamic memory system appear static to the host processor. The controller refreshes the memory chips, multiplexes the row and column addresses, generates control signals, determines the pre-charge period, and signals the processor when data is available or no longer needed. Furthermore, memory controllers also coordinate memory sharing between multiple processors and often assist in the detecting and correcting of memory errors.
Ensuring data integrity is a major concern in large dynamic random access memory (DRAM) systems, particularly because of their susceptibility to soft errors caused by alpha-particle radiation. Various parity encoding techniques have been developed to detect and correct memory errors. The parity bits, often called check bits when used for error correction as well as detection, are stored in the dynamic memory array along with the associated data bits. When the data is read, the check bits are regenerated and compared with stored check bits. If an error exists, whether in the retrieved check bits or in the retrieved data bits, the result of the comparison, typically called the syndrome, gives the location in the group of the bit in error.
The first step in rectifying such errors is to identify the error that occurred, as well as various signals present in the computing system at the time of the error. In some computing systems, these signals are generated by various circuit components and stored in one or more control and status registers, typically called "CSRs". For example, a typical CSR might be provided with information regarding some of the following items: an indication of what type of error occurred, the memory address that was being written to or read from when the error occurred, a number of check bits associated with the data that was being written to or read from memory when the error occurred, and the syndromes associated with the data that was being written or read at the time of the memory error.
Although CSRs are useful in solving memory problems in many applications, there are other applications in which further improvement would be helpful. Typically, one CSR is provided for each "memory module", wherein a memory module includes a collection of cooperating memory banks. Each time a new memory error occurs, the data associated with that error is written into the CSR associated with the memory module where the error occurred. If at the time of the new memory error the CSR already contains data corresponding to a previous error, data pertaining to the new error cannot be stored, and, at best, an error overflow bit can be set. Accordingly, a CSR at any given time is more likely to contain data associated with a frequently occurring memory error than data from an infrequent error. As a result, central processing components that utilize information obtained from CSRs are sometimes unable to detect the infrequent errors, thereby reducing the effectiveness of the fault management program's ability to process multiple faults. The overall effect is that the computing system's reliability is diminished.
The correctable data memory errors described above are only one class of three possible memory error classes. The other two classes are uncorrectable data memory errors and memory controller errors. On occasion, the type or class of error will require the replacement of the entire main memory, a particular memory bank, and/or the memory controller. For example, uncorrectable memory errors or memory controller errors may require the replacement of the memory module. Module replacement might also be required for frequently recurring correctable memory errors. These modules are typically returned to the manufacturer or to a repair facility where highly trained technicians or engineers test the memory to determine where and why the errors occurred.
Error logging features assist the technicians and facilitate the determination of the cause of the errors. A typical error logging feature may require tagging single bit errors and uncorrectable errors during memory read transmission from a memory subsystem. The memory controller may also save syndrome bits for the first memory read error and the error address for error logging and servicing. The memory controller will retain this information until the first error is serviced by the operating system. The memory controller may also contain one or more CSRs that are used for diagnostic purposes when the technician performs simulated memory reads in an attempt to reproduce the error. However, most errors are caused by transient faults. Thus, many errors are simply not reproducible.
Computer manufacturers spend many millions of dollars each year on memory module repair. Not uncommonly, the highly trained repair technicians fail to reproduce errors in a large percentage of the memory modules returned to repair centers throughout the world. Clearly, if a memory module fails in service, and this failure cannot be duplicated in a laboratory environment, designers cannot make effective modifications to the memories to avoid future failures.
The present invention is directed to overcoming, or at least reducing the effects of, one or more of the problems set forth above.