Ensuring the integrity of data processed by a data processing system such as a computer or like electronic device is critical for the reliable operation of such a system. Data integrity is of particular concern, for example, in fault tolerant applications such as servers, databases, scientific computers, and the like, where any errors whatsoever could jeopardize the accuracy of complex operations and/or cause system crashes that affect large numbers of users.
Data integrity issues are a concern, for example, for many solid state memory arrays such as those used as the main working storage repository for a data processing system. Solid state memory arrays are typically implemented using multiple integrated circuit memory devices such as static or dynamic random access memory (SRAM or DRAM) devices, and are controlled via memory controllers typically disposed on separate integrated circuit devices and coupled thereto via a memory bus. Solid state memory arrays may also be used in embedded applications, e.g., as cache memories or buffers on logic circuitry such as a processor chip.
A significant amount of effort has been directed toward detecting and correcting errors in memory devices during power up of a data processing system, as well as during the normal operation of such a system. It is desirable, for example, to enable a data processing system to, whenever possible, detect and correct any errors automatically, without requiring a system administrator or other user to manually perform any repairs. It is also desirable for any such corrections to be performed in such a fashion that the system remains up and running. Often such characteristics are expensive and only available on complex, high performance data processing systems. Furthermore, in many instances, many types of errors go beyond the ability of a conventional system to do anything other than “crash” and require a physical repair before normal device operation can be restored.
Conventional error detection and correction mechanisms for solid state memory devices typically rely on parity bits or checksums to detect inconsistencies in data as it is retrieved from memory. Furthermore, through the use of Error Correcting Codes (ECC's) or other correction algorithms, it is possible to correct some errors, e.g., single-bit errors up to single-device errors, and recreate the proper data.
In addition, some conventional correction mechanisms for solid state arrays may be capable of disabling defective devices or utilizing redundant capacity within a memory system to isolate errors and permit continued operation of a data processing system. For example, steering may be used to effectively swap out a defective memory device with a spare memory device. One drawback associated with using redundant capacity, however, is the need for redundant devices to be installed in an operating environment, which can add cost and complexity to a system for components that may never be used.
One particular area where it would be particularly desirable to provide improved error detection and correction relates to failed data lines or interfaces used with a memory device. Data lines can go bad between devices due to shorts, opens, increased resistance, or various forms of coupled noise, which can often lead to system failures. Sometimes it may take the contribution of more than one of these factors to cause a failure to occur.
Whenever a data line fails, e.g., within a memory device, within a memory controller and/or within a signal path therebetween, often the data accessed via the data line, e.g., the data stored in a memory array, may still be valid and uncorrupted. However, with a failure in a data line coupled to a memory array, the data in the memory array typically becomes inaccessible externally from the memory device.
In addition, in some memory systems, individual memory devices are provided with multiple memory arrays, with separate data lines dedicated to each array on the device. For example, a synchronous DRAM (SDRAM) with four memory arrays may be designated as an x4 device, with one data line dedicated to each array, resulting in a total of four data lines. With the failure of only one data line, however, an entire memory device typically becomes compromised, even if the other data lines continue to operate normally. ECC is often available to correct and detect errors in such a circumstance, however, whenever a failed data line occurs, a risk exists that another error may arise in another area of the system and expose the memory device to unrecoverable errors that may lead to data corruption and/or system failure.
Therefore, a significant need continues to exist in the art for a manner of addressing failures in a data interfaces used with memory devices and other logic circuits.