A variety of factors including faulty components and inadequate design tolerances may result in errors in the data being processed by a computer. These errors also commonly occur during data transmission due to "noise" in the communication channel. As a result of these errors, one or more bits, which may be represented as X, which are to be transmitted within the system, are corrupted so as to be received as /X (i.e. the logical complement of the value of X). In order to protect a computer system against such errors, the data bits may be coded via error correcting code (ECC) in such a way that the errors may be detected and possibly corrected by special ECC logic circuits. A typical ECC implementation appends a number of check bits to each data word. The appended check bits are used by the ECC logic circuits to detect errors within the data word.
The simplest and most common form of error control is implemented through the use of the parity bit. The single parity bit is appended to the data word and assigned to be either a 0 or a 1, so as to make the number of 1's in the data word even in the case of even parity codes, or odd in the case of odd parity codes.
Prior to the transmission of the data word in a computer system, often upon the initial storage of the data word, the value of the parity bit is computed at the source point and appended to the data word. Upon receipt of the transmitted data word, logic at the destination point recalculates the parity bit and compares it to the received, previously appended parity bit. If the recalculated and received parity bits are not equal a single bit error has been detected. Specifically, this means that a single data bit in the data word has transitioned from its original value, for example 1 to 0 or 0 to 1. If the received and recalculated parity bits are equal, then it can be concluded that such a single bit error did not occur, however multiple bit errors may not be ruled out. For example, if a data bit changes from a 0 to a 1 and another data bit changes from a 1 to a 0 (i.e. a double bit error) the parity of the data word will not change and the error will be undetected. Thus, use of the parity bit provides single error detection, however, it fails to detect every multiple even bit error, and it fails to provide information on the location of the erroneous bit(s).
By appending additional parity bits to the data word, each corresponding to a subset of data bits within the data word, the parity bit concept may be easily expanded to provide the detection of multiple bit errors or to determine the location of single or multiple bit errors. Once a data bit error is located it is a simple matter to cause a logic circuit to correct the located erroneous bit, thereby providing single error correction (SEC). Many single error correction codes have the ability to detect double errors and are thus termed single error correcting double error detecting codes (SEC-DED).
Multiple error detection schemes rely on appending additional check bits to the data word. The most well-known SEC-DED ECC is the so-called Hamming code, which appends a series of check bits to the data word as it is stored in memory. Upon a read operation, the retrieved check bits are compared against recalculated check bits to detect, locate and correct a single bit error. By adding more check bits and appropriately overlapping the subsets of data bits represented thereby, other error correcting codes have been devised for providing three bit error detection and two bit error correction, and, via the further addition of check bits, codes can be formulated to detect and correct any number of data bit errors.
The use of such robust forms of ECC has long been recognized as a necessity for the main storage on large computer systems such as the S/390 Parallel Enterprise Server--Generation 3 and the S/390 Parallel Enterprise Server--Generation 4 computer systems available from IBM Corporation (S/390 and IBM are registered trademarks, and S/390 Parallel Enterprise Server is a trademark of IBM Corporation). Since the main storage on such large systems often serves as the central data repository accessed by disparate users throughout an enterprise, the criticality of preserving the integrity of the massive amount of data stored on such large systems is readily apparent. Accordingly, large system customers have long demanded that their systems incorporate a form of multiple error detecting and correcting ECC.
With the advent of the network centric model for computer systems and with the increased power available in relatively small computer systems, the role of the server has increasingly become a shared role, with the traditional high-end mainframe computers operating at one extreme, and small PC-based servers operating at the other extreme. Until recently, small PC-based servers which serve either a departmental, office or workgroup network did not include even rudimentary SEC ECC. Many commercially available PCs still implement parity bit error control schemes, which as previously shown, are not adequate protection against all double bit errors and do not offer any error correction facilities.
Cognizant of the newly created need to provide a more robust ECC to these small scale servers, companies have begun to offer retrofit mechanisms such as ECC-on-SIMM (single in-line memory module) or "EOS" (available from IBM Corporation) which transparently implements a compatible, self-contained, on-SIMM, SEC ECC into an existing parity PC system. The underlying concepts of the EOS product are embodied in U.S. Pat. No. 5,623,506, issued to Dell et al., U.S. Pat. No. 5,465,262, issued to Dell et al., U.S. Pat. No. 5,450,422, issued to Dell, and U.S. Pat. No. 5,379,304 issued to Dell et al. Each of the foregoing patents are assigned to IBM Corporation the present assignee hereof and are incorporated herein by reference. With the availability of such products, the server owner may upgrade his/her server to include SEC ECC without having to change the planar/motherboard, memory controller, or operating system software. Additionally, memory controller chip sets which support SEC ECC are becoming increasingly commercially available. Moreover, microprocessor manufacturers are now beginning to offer SEC ECC support in their products such as the Intel Pentium Pro Microprocessor (Intel and Pentium are registered trademarks of Intel Corporation).
While these SEC ECC retrofit products offer increased protection for the PC-based servers, their ECC is limited and will not, for example, correct multiple data bit errors such as would be experienced upon the failure of an entire dynamic random access memory (DRAM) chip, without the addition of special high-end architectural techniques which would prove prohibitively costly for the consumer of PC-servers.
Accordingly, there exists a need for a simple, transparent mechanism by which a user may retrofit a more robust ECC to an existing SEC ECC or parity based computer system. In order for such a solution to prove effective, the mechanism should be cost efficient, and totally compatible with the existing computer system. The retrofit mechanism must enable the correction of an entire DRAM chip failure and preferably would be compatible with commercially available DRAM chips whether organized with four data bits per chip or with eight data bits per chip. Finally, the retrofit mechanism should be provided in an efficient and practical manner that will facilitate easy implementation in a commercially available application specific integrated circuit (ASIC).
With such a solution, a server owner may easily upgrade the ECC for his/her server without undergoing the labor and expense of modifying the processor or controller hardware or changing the operating software therefor. As such, the level of data integrity in the server may be easily scaled in accordance with the storage and access requirements thereof.
The advantages provided by the aforementioned retrofit ECC schemes precipitate a new set of issues to be resolved by the server owner. In particular, the ability to transparently enhance error correction for the computer system via the addition of a new error correction apparatus may prevent the computer system from properly tracking the frequency of errors being transparently corrected via this new device.
For example, in a computer system implementing parity error control and having an IBM EOS upgrade to enable SEC ECC therein, the EOS SIMM will correct all single bit errors without notifying the original error control logic of the computer system of the occurrence of these errors. Without such notification, the computer system cannot utilize its existing error control logic to determine, based upon the errors from the SIMMs, if it is necessary to initiate a maintenance notification so as to replace a SIMM that has been accumulating errors. Failure to notify the system of such accumulating errors for a SIMM could lead to a condition wherein more than one bad bits are aligned in a single ECC word, which would constitute an uncorrectable error for such a system.
Likewise, as the retrofit ECC enhancing mechanism provides greater and greater error correcting capacity, as in the aforementioned inventive DRAM chip correction retrofit apparatus, the ability of the original system to properly recognize and respond to accumulating SIMM errors is correspondingly diminished with the concomitant consequence of failing to permit the detection and notification of a maintenance requirement.
In certain systems implementing the IBM EOS ECC enhancement, the problem has been addressed by installing special hardware, which, when added to the memory subsystem, permits the error control logic to sense error lines of the EOS SIMM. A special version of the EOS SIMM has been devised to bring error lines to the SIMM tabs, and to activate them upon the correction of an error. This solution, however, requires a hardware change to the existing computer system and as such is inconsistent with the objectives of the EOS device in that with such hardware modifications, the EOS device ceases to transparently implement the ECC upgrade.
Accordingly, there exists a further need to provide a transparent mechanism for notifying the original error control logic in computer system having an ECC enhanced retrofit device that the enhanced ECC device has corrected an error. With such a solution the original computer system error control logic will retain the ability to determine the accumulation of errors and implement a preventative maintenance strategy accordingly. Consequently, error correction and maintenance operations are transparently enhanced within the original computer system by virtue of the inventive apparatus.