Typically, systems that make use of electronic data and commands store the data and commands in some sort of memory device. Access to the data stored in the memory device is often controlled by a memory control unit (MCU). In such a system, each individual memory device is assigned to an MCU, and each MCU controls access to its corresponding memory devices. For example, in a computer, when a microprocessor requests data from the computer's Random Access Memory (RAM), the microprocessor sends a request to the MCU corresponding to the RAM, and the MCU fetches the requested data from the RAM and sends it to the microprocessor.
The MCU also controls the writing of data to the memory devices assigned to it. For the system to run reliably, the data written to the memory device and stored in the memory device must be identical to the data received by the MCU. In order to ensure the accuracy of the data written and stored in the memory device, the MCU may use an Error Correcting Circuit (ECC) to verify the receipt, writing and storage of the incoming data.
ECC's may detect single- and multi-bit errors in a data stream. Single-bit errors are most often caused by a random transient event. ECC's also have the ability to correct most single-bit errors, allowing the system to operate without downtime attributable to correctable single-bit errors. However, there are certain types of single-bit errors that are uncorrectable. These uncorrectable single-bit errors are sometimes referred to as “sticky bits.” A “sticky bit” usually is the result of a physical problem in the memory device. A “sticky bit” does not always change state when requested. In addition, there are multi-bit errors that are typically uncorrectable. The multi-bit errors may be the result of a physical problem in the memory device, like a “sticky bit,” or they may be the result of some other random transient event.
Unlike correctable single-bit errors, multi-bit errors and uncorrectable single-bit errors will result in a difference between the data received and the data actually written to the memory device. Upon receipt and identification of an uncorrectable error, most systems will shut down to avoid further corruption of the data or operating with corrupted data. For example, in a computer system, receipt and identification of an uncorrectable error will often cause the initiation of a controlled shutdown, including a recordation of the error. In contrast, operating the system with corrupted data may lead to an uncontrolled shutdown, or “crash.” In any case, uncorrectable errors result in shutdowns that reduce the uptime and availability of the system.
To improve the reliability and availability of such a system, what is needed is a solution that may, upon identification of an uncorrectable error, switch to a redundant backup memory system so that the system may continue operating without downtime. In addition, what is needed is a solution that is able to test the memory device to determine if the uncorrectable error is due to a physical problem with the memory device, or due to a random transient event.