1. Field of the Invention
This invention relates to computer system reliability and, more particularly, to the detection of errors in memory subsystems.
2. Description of the Related Art
Computer systems are typically available in a range of configurations which may afford a user varying degrees of reliability, availability and serviceability (RAS). In some systems, reliability may be paramount. Thus, a reliable system may include features designed to prevent failures. In other systems, availability may be important and so systems may be designed to have significant fail-over capabilities in the event of a failure. Either of these types of systems may include built-in redundancies of critical components. In addition, systems may be designed with serviceability in mind. Such systems may allow fast system recovery during system failures due to component accessibility. In critical systems, such as high-end servers and some multiple processor and distributed processing systems, a combination of the above features may produce the desired RAS level.
Depending on the type of system, data that is stored in system memory may be protected from corruption in one or more ways. One such way to protect data is to use error detection and/or error correction codes (ECC). The data may be transferred to system memory with an associated ECC code which may have been generated by a sending device. ECC logic may then regenerate and compare the ECC codes prior to storing the data in system memory. When the data is read out of memory, the ECC codes may again be regenerated and compared with the existing codes to ensure that no errors have been introduced to the stored data.
In addition, some systems may employ ECC codes to protect data that is routed through out the system. However, in systems where a system memory module such as for example, a dual in-line memory module (DIMM) is coupled to a memory controller, the data bus and corresponding data may be protected as described above but the address, command and control information and corresponding wires may not. In such systems, a bad bit or wire which conveys erroneous address or command information may be undetectable as such an error. For example, correct data may be stored to an incorrect address or data may not be actually written to a given location. When the data is read out of memory, the ECC codes for that data may not detect this type of error, since the data itself may be good. When a processor tries to use the data however, the results may be unpredictable or catastrophic.