1. Technical Field
The present invention relates to a system and method for adjusting command streams to memory in response to bit errors in the memory. More particularly, the present invention relates to a system and method adjusting refresh rates and usage delays in response to bit errors exceeding a given threshold.
2. Description of the Related Art
Memory errors in computer systems can result from a variety of factors, such as heat and other environmental conditions. Poor ventilation of a computer system, caused from the system's physical location or from dust blocking air ducts, can cause increased heat in a computer system and may result in memory errors. In addition, power supplies can also cause power fluctuations that can lead to memory errors.
Memory is an electronic storage device, and like most electronic storage devices, it has the potential to incorrectly return information different than what was originally stored in the memory. Dynamic Random Access Memory (DRAM), because of its nature, is likely to return occasional memory errors. DRAM stores ones and zeros as charges on small capacitors that are continually refreshed to ensure that the data is not lost. Generally, DRAM is less reliable than the static storage used by Static Random Access Memory (SRAMs). While SRAMs are generally more reliable and faster than DRAMs, SRAMs are generally more expensive and are therefore often used for cache memory with main memory often using DRAMs. A “Rank” defines a set of DRAM chips (on a module) generally comprising 8 byte wide (64 bits) data, or 9 bytes (72 bits) with ECC. Devices in a Rank are connected by a single Chip-Select. The actual memory size in a Rank is not defined. Single-sided memory modules are generally Single-Rank, while double-sided unbuffered DIMMs and SODIMMs are generally Dual-Rank. Server DIMMs may have up to 4 ranks.
One kind of memory error that occurs in computer systems is called a transient or soft error. This occurs when a bit reads back the wrong value once, but subsequently functions correctly. Soft errors are sometimes caused by memory that is physically bad, but at least as often they are the result of poor quality motherboards, memory system timings that are set too fast, static shocks, or environmental factors such as heat and power fluctuations.
Errors in the memory subsystem are a concern at the system level as they can result in a system crash or incorrect data that can affect the outcome of the applications running on the system. Parity checking provides single-bit error detection for the system memory, but generally does not handle multi-bit errors, and does not provide a way to correct memory errors. An advanced error detection and correction protocol was invented to go a step beyond simple parity checking. Called ECC, which stands for error correcting circuits, error correcting code, or error correction code, this protocol not only detects both single-bit and multi-bit errors, it will actually transparently correct errors on the fly.
ECC uses an algorithm to encode information in a block of bits that contains sufficient detail to permit the recovery of a memory error in the protected data. Depending on the complexity of the ECC algorithm a single bit error, multiple single bit errors, or a packet error (a block of consecutive bits) can be corrected. Unlike parity, which uses a single bit to provide protection to eight bits, ECC uses larger groupings: for example 7 bits to protect 32 bits, or 8 bits to protect 64 bits. Industry standard memory modules come in two form factors: a non-ECC module with 64 bits of data and a ECC module with 64 bits of data and 8 bits for the ECC code. These ECC modules can be used for memory subsystems that implement either a parity protection scheme or a ECC scheme. For systems that require a higher level of reliability, custom memory modules with additional ECC bits can be defined to allow for more correction and detection capability.
ECC has the ability to correct a detected error in a block of memory. When this happens, the computer will continue without incident and the user will not be aware that a memory error occurred. Although the addition of ECC to a memory subsystem has improved the memory reliability, ECC has a limited correction capability. Error Correction Codes can be designed to correct as little as a single bit error in a block of memory to correcting large numbers of random bit errors in a block of memory. At the system level the trade off is correction capability versus the cost of implementing the code in logic complexity and overhead of DRAM bits. In general, the ECC for a system is designed to correct a subset of the possible memory errors. This subset is based on analysis of failure mechanisms in the system when the system is running at nominal conditions. For example, analysis of a system design may indicate that, under nominal conditions, a very high percentage of memory errors that occur would be single bit failures. For this example system, the ECC may be designed to correct single bit errors and detect any two single bit errors. If an error occurred that was larger than the correction capability of the error correction code, then that data from memory would be unusable and the application being executed would fail.
The memory subsystem design point generally assumes that a single error event might occur in a block of memory. This one error event is correctable by way of the ECC logic. This assumption holds true as long as the block of memory is read and corrected before a second error event occurs. Under normal conditions, this is generally not a problem as error events usually occur far apart in terms of time. One challenge faced by the memory subsystem is that under extreme conditions the actual rate that a fail event can occur may increase to the point that multiple fail events would occur between access to the block of memory. This would result in a block of memory with unusable data and a potential for an application or system crash. This condition of increased fail rates can be the result of environmental factors (such as excessive heat), system level noise due to high activity levels, power line disturbances and many other factors. These conditions are difficult and expensive to detect with normal means and, unless detected in time, result in a non-recoverable system error. Moreover, once detected, there is currently no way to dynamically alter the correction capability of the memory subsystem to avoid the failure.