Ensuring the integrity of data processed by a data processing system such as a computer or like electronic device is critical for the reliable operation of such a system. Data integrity is of particular concern, for example, in fault tolerant applications such as servers, databases, scientific computers, and the like, where any errors whatsoever could jeopardize the accuracy of complex operations and/or cause system crashes that affect large numbers of users. In many of such applications, system availability is of paramount importance, and as such, the ability to detect and/or correct potential failures in a system is a highly desirable feature.
Data integrity issues are a concern, for example, for many solid state memory arrays such as those used as the main working storage repository for a data processing system. Solid state memory arrays are typically implemented using multiple integrated circuit memory devices or chips such as static or dynamic random access memory (SRAM or DRAM) devices, and are controlled via memory controllers typically disposed on separate integrated circuit devices and coupled thereto via a memory bus. Solid state memory arrays may also be used in embedded applications, e.g., as cache memories or buffers on logic circuitry such as a processor chip.
It has been found, for example, that solid state memory devices are often susceptible to adverse temperature effects, leading to issues such as lost bits, poor timing characteristics, increased noise, and decreased performance. While some of these issues may result in the generation of errors that are potentially correctable without any loss of data, in some instances temperature effects or other adverse conditions may lead to higher error rates, and thus a greater risk of encountering a non-recoverable error. Furthermore, with the increased power levels seen in higher performance systems, as well as the typical focus on controlling the operating temperature of the processor chips in a system, it has been found that many memory devices are required to operate at or above recommended temperature limits. Furthermore, in many instances these effects may be dependent upon the physical location of a device and the airflow characteristics of the system enclosure, and the temperature effects experienced by one device may differ from other devices in the same system.
A significant amount of effort has been directed toward detecting and correcting errors in memory devices during power up of a data processing system, as well as during the normal operation of such a system. It is desirable, for example, to enable a data processing system to, whenever possible, detect and correct any errors automatically, without requiring a system administrator or other user to manually perform any repairs. It is also desirable for any such corrections to be performed in such a fashion that the system remains up and running. Often such characteristics are expensive and only available on complex, high performance data processing systems. Furthermore, in many instances, many types of errors go beyond the ability of a conventional system to do anything other than “crash” and require a physical repair before normal device operation can be restored.
Conventional error detection and correction mechanisms for solid state memory devices typically rely on parity bits or checksums to detect inconsistencies in data as it is retrieved from memory. Furthermore, through the use of Error Correcting Codes (ECC's) or other correction algorithms, it is possible to correct some errors, e.g., single-bit errors up to single-device errors, and recreate the proper data. Another capability supported in some systems is referred to as “memory scrubbing,” where a background process periodically reads each location in a memory array and utilizes ECC circuitry to detect and (if possible) correct any errors in the array. In large memory systems, however, background scrubbing can take hours or days to make a complete pass through the memory space; otherwise, faster scrubbing may be used, albeit with reduced system performance due to the need to allocate a portion of the available memory bandwidth to scrub operations.
In addition, some conventional correction mechanisms for solid state arrays may be capable of disabling defective devices or utilizing redundant capacity within a memory system to isolate errors and permit continued operation of a data processing system. For example, steering may be used to effectively swap out a defective memory device with a spare memory device. Of note, however, is the fact that redundant devices are not used unless a failure is detected in another device. Furthermore, even when a redundant device is used, the use may only be temporary, until the failed device (or card upon which the device is mounted) is replaced. As a result, redundant devices add to the overall cost of a system, while remaining idle a vast majority of the time.
While existing error detection and correction mechanisms provide increased reliability, a need still exists for enhancing the ability of a data processing system to monitor the health of its memory system, and correct any errors or potential errors with little or no impact on system performance and availability. Furthermore, a significant need also exists for expanding the monitoring capability of a data processing system in a manner that is cost efficient and that requires minimal modification and minimal impact on system performance.