This application contains subject matter which is related to the subject matter of the following applications, each of which is assigned to the same assignee as this application and filed on the same day as this application. Each of the below listed applications is hereby incorporated herein by reference in its entirety:
xe2x80x9cSingle Symbol Correction Double Symbol Detection Code Employing A Modular H-Matrix,xe2x80x9d Chen et al., Ser. No. 09/451,133;
xe2x80x9cDetecting Address Faults In An ECC-Protected Memory,xe2x80x9d Chen et al., Ser. No. 09/451,261; and,
xe2x80x9cMethod, System And Program Products For Error Correction Code Conversion,xe2x80x9d Chen et al., Ser. No. 09/450,548.
This invention relates, in general, to computer error correction codes, and in particular, to generating a special error correction code for failure isolation.
The small size of computer transistors and capacitors, combined with transient electrical and electromagnetic phenomena cause occasional errors in stored information in computer memory systems. Therefore, even well-designed and generally reliable memory systems are susceptible to memory device failures.
In an effort to minimize the effects of these memory device failures, various error checking schemes have been developed to detect, and in some cases correct, errors in messages read from memory. The simplest error detection scheme is the parity bit. A parity bit is an extra bit included with a binary data message or data word to make the total number of 1""s in the message either odd or even. For xe2x80x9ceven parityxe2x80x9d systems, the parity bit is set to make the total number of 1""s in the message even. For xe2x80x9codd parityxe2x80x9d systems, the parity bit is set to make the total number of 1""s in the message odd. For example, in a system utilizing odd parity, a message having two 1""s would have its parity bit set to 1, thereby making the total number of 1""s odd. Then, the message including the parity bit is transmitted and subsequently checked at the receiving end for errors. An error results if the parity of the data bits in the message does not correspond to the parity bit transmitted. As a result, single bit errors can be detected. However, since there is no way to detect which particular bit is in error, correction is not possible. Furthermore, if two or any even number of bits are in error, the parity will be correct and no error will be detected. Parity therefore is capable of detecting only odd numbers of errors and is not capable of correcting any bits determined to be in error.
Error correction codes (ECCs) have thus been developed to not only detect but also correct bits determined to be in error. ECCs utilize multiple parity check bits stored with the data message in memory. Each check bit is a parity bit for a group of bits in the data message. When the message is read from memory, the parity of each group, including the check bit, is evaluated. If the parity is correct for all of the groups, it signifies that no detectable error has occurred. If one or more of the newly generated parity values are incorrect, a unique pattern called a syndrome results which may be used to identify the bit in error. Upon detection of the particular bit in error, the error may be corrected by complementing the erroneous bit.
A widely used type of ECC utilized in error control in digital systems is based on the codes devised by R. W. Hamming, and thus take the name xe2x80x9cHamming codesxe2x80x9d. One particular subclass of Hamming codes includes the single error correcting and double error detecting (SEC-DED) codes. As their name suggests, these codes may be utilized not only to correct any single bit error but also to detect double bit errors.
Another type of well-known ECC is the single symbol correction and double symbol detection (SSC-DSD) codes which are used to correct single symbol errors and detect double symbol errors. In systems implementing these types of codes, the symbol represents a multiple bit package or chip. Hence, as the name implies, an SSC-DSD code in a system utilizing n bit symbols would be capable of correcting n bits in a single symbol and detecting errors occurring in double symbols.
One limitation of typical ECCs is their inability to isolate uncorrectable errors to individual components of computing systems. For instance, although these ECCs are able to detect the occurrence of an uncorrectable error at a particular component of a computing system, no measure is taken to indicate the exact location of data corruption. Thus, after subsequent transmissions of the data, even though a user may be aware of the fact that the data had been corrupted, the user would not be aware of the location or component where the error actually occurred.
Accordingly, a need exists for an uncorrectable error isolation measure which is capable of isolating the occurrence of an uncorrectable error to a particular component of a computing system.
The shortcomings of the prior art are overcome and additional advantages are provided through the provision of an uncorrectable error isolation capability which isolates an uncorrectable error to one component of a plurality of components of a computing system. In one example, a method of isolating these uncorrectable errors includes: generating, upon detection of an uncorrectable error in a data word at the one component of the computing system, a check bit pattern to indicate occurrence of the uncorrectable error at the one component; and incorporating the check bit pattern into the data word.
In another example of the invention, a system for isolating an uncorrectable error to one component of a plurality of components of a computing system includes: means for generating, upon detection of an uncorrectable error in a data word at the one component of the computing system, a check bit pattern to indicate occurrence of the uncorrectable error at the one component; and means for incorporating the check bit pattern into the data word.
In still yet another example of the invention, an article of manufacture comprises a computer usable medium having computer readable program code means embodied therein for isolating an uncorrectable error to one component of a plurality of components of a computing system. The computer readable program code means includes: a computer useable medium having computer readable programs code means embodied therein for isolating an uncorrectable error to one component of a plurality of components of a computing system, the computer readable program code means in the article of manufacture comprising: computer readable program code means for generating, upon detection of an uncorrectable error in a data word at the one component of the computing system, a check bit pattern to indicate occurrence of the uncorrectable error at the one component; and computer readable program code means for incorporating the check bit pattern into the data word.
Thus, described herein is a technique for isolating an uncorrectable error to one component of a plurality of components of a computing system. This technique first generates, upon the detection of an uncorrectable error in a data word at one component of the computing system, a check bit pattern to indicate the occurrence of an uncorrectable error. In addition, the check bit pattern is generated to correspond to a particular component of the computing system. Thus, each check bit pattern may be used to identify a particular component determined to be flawed or damaged. Subsequently, the check bit pattern is incorporated into the data word. In this manner, information regarding the occurrence of an uncorrectable error, as well as the location of the error, is transmitted with the data word. Thereafter, any appropriate recovery actions may be taken to address the uncorrectable error.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.