1. Field of the Invention
The present invention relates to the field of error detection and correction.
2. Prior Art
Error detection and correction (ECC) is commonly used in digital systems wherein the chance of an error occurring on occasion is substantial, and in high reliability systems wherein immunity to even fairly rare errors is desired. Systems of the first kind would include communication systems such as over the air and over phone lines, wherein noise or temporary loss of signal may cause one or more errors in the data received. Systems of the second kind would include fault tolerant computer systems wherein, even though the likelihood of an error is already low, it is desired to correct the more likely of such errors and to detect the presence of the next likely errors so that the system will normally continue operating without error, even when errors in fact occur, and can flag even a level of uncorrectable error.
Various types of error detection and correction codes are well known in the prior art, and accordingly the existence and nature of such codes need not be described in detail herein. In general, such codes provide for the appending of number m of ECC bits (ECC word) to each n-bit dataword, the number of bits m required for the EEC bits being dependent upon the size of the dataword, the number and nature of the errors to be detected and corrected, the number and nature of additional errors, if any, which are to be detected even though the same cannot be corrected, and on the specific code itself.
References describing numerous EEC codes include "Error Correcting Codes, Second Edition", W. Peterson et al., printed by The Massachusetts Institute of Technology (1972), "Error-Correction Coding for Digital Communications", G. Clark et al., printed by Plenum Press (1981) and "A Class of Odd-Weight-Column SEC-DED-SbED Codes for Memory System Applications, S. Kaneda, IEEE Transactions on Computers, Vol. C-33, No. 8 (1984).
The later publication describes what are commonly referred to as SEC-DED-S4ED ECC Codes, which Sun Microsystems, assignee of the present invention, has previously used for error detection and correction in main memory systems. Such codes are characterized by single error correction (SEC), meaning that a single error occurring anywhere in the combination of the dataword and ECC word may be corrected, double error detection (DED), meaning that any two errors in the codeword (a codeword is the combination of the dataword and the ECC word) may be detected even if the same cannot be corrected, and that any errors occurring in a single nibble (a 4-bit sequence bounded by predefined nibble boundaries) may be detected (S4ED). This latter capability provides the capability of detecting the presence of up to four adjacent errors, provided they occur within the same nibble.
In the prior art system referred to above, the dataword is 64 bits long, with the associated SEC-DED-S4ED Code being 8 bits long, providing an entire codeword (data plus ECC) of 72 bits. Integrated memory circuit devices in the past have most commonly been organized with a 4 bit wide output, with 72 bit wide SIMMs being readily commercially available using eighteen, 4 bit wide dynamic random access memories (DRAMs). In such a system, the ability to detect errors in any nibble of each 72 bit codeword provides the ability to detect the failure of any one of the eighteen memory devices, simply by having each 4 bit memory device output comprise a nibble in accordance with the SEC-DED-S4ED Code definition. Consequently, failures confined to a single memory device, such as a total device failure, a single line failure, an addressing failure, etc. may be detected. Thus, the most likely uncommon errors can be either detected and corrected, or at least detected. Specifically, the occasional random error (commonly referred to as soft errors because of their generally non-recurring nature) will be corrected. Since such errors only occur in 1 in N memory operations, where N is normally a very large number, the likelihood of occurrence of two such errors at the same time will be 1 in N.sup.2, a very unlikely occurrence which will still be detected. The likelihood of three random errors occurring at the same time will be 1 in N.sup.3. While three such errors will be detected only if they occur in the same nibble, the likelihood of three random errors occurring at the same time in different nibbles in normal DRAM devices is quite remote. The possibility of a device failure, however, is sufficient to make the detection of such failures highly desirable. Such failures are normally hard failures (a permanent device fault or failure), making the detection of such errors particularly important.
More recently, DRAM devices organized with 8 bit wide outputs and 16 bit wide outputs have become common. Consequently, an SEC-DED-S4ED Code will not normally detect device failures in devices having 8 bit or 16 bit outputs when such devices are used to provide corresponding 8 or 16 bits of the 72 bit codeword. However, in the prior art system, the main memory is organized so that on each read operation, four 72 bit codewords are read from main memory simultaneously. DRAMs using a 4 bit wide output are used with 1 bit of each 4 bit output being associated with a respective one of the four codewords read simultaneously. In this manner, when a DRAM fails (and this is the only error occurring at that time) errors will appear in only a single bit of each of the four codewords, making the DRAM failure correctable with the SEC-DED-S4ED Code of the respective four codewords.