A variety of factors including faulty components and inadequate design tolerances may result in errors in the data being processed by a computer. These errors also commonly occur during data transmission due to "noise" in the communication channel. As a result of these errors, one or more bits, which may be represented as X, which are to be transmitted within the system, are corrupted so as to be received as /X (i.e. the logical complement of the value of X). In order to protect a computer system against such errors, the data bits may be coded via error correcting code (ECC) in such a way that the errors may be detected and possibly corrected by special ECC logic circuits.
A typical ECC implementation appends a number of check bits to each data word. The appended check bits are used by the ECC logic circuits to detect errors within the data word.
The simplest and most common form of error control is implemented through the use of the parity bit. The single parity bit is appended to the data word and assigned to be either a 0 or a 1, so as to make the number of 1's in the data word even in the case of even parity codes, or odd in the case of odd parity codes.
Prior to the transmission of the data word in a computer system, often upon the initial storage of the data word, the value of the parity bit is computed at the source point and appended to the data word. Upon receipt of the transmitted data word, logic at the destination point recalculates the parity bit and compares it to the received, previously appended parity bit. If the recalculated and received parity bits are not equal a single bit error has been detected. Specifically, this means that a single data bit in the data word has transitioned from its original value, for example 1 to 0 or 0 to 1. If the received and recalculated parity bits are equal, then it can be concluded that such a single bit error did not occur, however multiple bit errors may not be ruled out. For example, if a data bit changes from a 0 to a 1 and another data bit changes from a 1 to a 0 (i.e. a double bit error) the parity of the data word will not change and the error will be undetected. Thus, use of the parity bit provides single error detection, however, it fails to detect every multiple even bit error, and it fails to provide information on the location of the erroneous bit(s).
By appending additional parity bits to the data word, each corresponding to a subset of data bits within the data word, the parity bit concept may be easily expanded to provide the detection of multiple bit errors or to determine the location of single or multiple bit errors. Once a data bit error is located it is a simple matter to cause a logic circuit to correct the located erroneous bit, thereby providing single error correction (SEC). Many single error correction codes have the ability to detect double errors and are thus termed single error correcting double error detecting codes (SEC-DED).
Multiple error detection schemes rely on appending additional check bits to the data word. The most well-known SEC-DED ECC is the so-called Hamming code, which appends a series of check bits to the data word as it is stored in memory. Upon a read operation, the retrieved check bits are compared against recalculated check bits to detect, locate (i.e. correct) a single bit error. By adding more check bits and appropriately overlapping the subsets of data bits represented thereby, other error correcting codes have been devised for providing three bit error detection and two bit error correction, and, via the further addition of check bits, codes can be formulated to detect and correct any number of data bit errors.
The use of such robust forms of ECC has long been recognized as a necessity for the main storage on large computer systems such as the S/390 Parallel Enterprise Server--Generation 3 and the S/390 Parallel Enterprise Server--Generation 4 computer systems available from IBM Corporation (S/390 and IBM are registered trademarks, and S/390 Parallel Enterprise Server is a trademark of IBM Corporation). Since the main storage on such large systems often serves as the central data repository accessed by disparate users throughout an enterprise, the criticality of preserving the integrity of the massive amount of data stored on such large systems is readily apparent. Accordingly, large system customers have long demanded that their systems incorporate a form of multiple error detecting and correcting ECC.
With the advent of the network centric model for computer systems and with the increased power available in relatively small computer systems, the role of the server has increasingly become a shared role, with the traditional high-end mainframe computers operating at one extreme, and small PC-based servers operating at the other extreme. Until recently, small PC-based servers which serve either a departmental, office or workgroup network did not include even rudimentary SEC ECC. Many commercially available PCs still implement parity bit error control schemes, which as previously shown, are not adequate protection against all double bit errors and do not offer any error correction facilities.
Cognizant of the newly created need to provide a more robust ECC to these small scale servers, companies have begun to offer retrofit mechanisms such as ECC-on-SIMM (single in-line memory module) or EOS (available from IBM Corporation) which transparently implements a compatible, self-contained, on-SIMM, SEC ECC into an existing parity PC system. The underlying concepts of EOS are embodied in U.S. Pat. No. 5,623,506, issued to Dell et al., U.S. Pat. No. 5,465,262, issued to Dell et al., U.S. Pat. No. 5,450,422, issued to Dell, and U.S. Pat. No. 5,379,304 issued to Dell et al. Each of the foregoing patents are assigned to IBM Corporation the present assignee hereof and each of the patents are incorporated herein by reference.
With the availability of such products, the server owner may upgrade his/her server to include SEC ECC without having to change the planar/motherboard, memory controller, or operating system software. Additionally, memory controller chip sets which support SEC ECC are becoming increasingly commercially available. Moreover, microprocessor manufacturers are now beginning to offer SEC ECC support in their products such as the Intel Pentium Pro Microprocessor (Intel and Pentium are registered trademarks of Intel Corporation).
While these SEC ECC retrofit products offer increased protection for the PC-based servers, their ECC is limited and will not, for example, correct multiple data bit errors such as would be experienced upon the failure of an entire dynamic random access memory (DRAM) chip, without the addition of special high-end architectural techniques which would prove prohibitively costly for the consumer of PC-servers.
Accordingly, there exists a need for a simple, transparent mechanism by which a user may retrofit a more robust ECC to an existing SEC ECC or parity based computer system. In order for such a solution to prove effective, the mechanism should be cost efficient, and totally compatible with the existing computer system. The retrofit mechanism must enable the correction of an entire DRAM chip failure and preferably would be compatible with commercially available DRAM chips whether organized with four data bits per chip or with eight data bits per chip. Finally, the retrofit mechanism should be provided in an efficient and practical manner that will facilitate easy implementation in a commercially available application specific integrated circuit (ASIC).
With such a solution, a server owner may easily upgrade the ECC for his/her server without undergoing the labor and expense of modifying the processor or controller hardware or changing the operating software therefor. As such, the level of data integrity in the server may be easily scaled in accordance with the storage and access requirements thereof.