The present invention is generally directed to fault tolerant computer memory systems. More particularly, the present invention is directed to computer memory systems which employ both chip level and system level error correction coding schemes. Even more particularly, the present invention relates to memory chips having on-chip error correction capabilities and error correction disabling means to allow the reproduction of hard errors, particularly in those situations in which the reproducibility of these errors is important for system level error recovery procedures.
As semiconductor memory chips are developed with smaller and smaller feature sizes and a corresponding increase in circuit packaging density, additional error correction methods, such as on-chip error correction, become more and more important. In general, memory errors occurring on a chip fall into two distinct categories: hard errors and soft errors. Soft errors are typically transient events, such as those induced by background level alpha particle radiation or caused by parametric process sensitivities that create "weak cells". Weak cells are those that fail upon application of unique voltages or data patterns, or are otherwise sensitive to noise, printed image size or image tracking. With increasing chip densities, soft errors become more frequent. Thus, increasing chip density dictates a greater need for on-chip error correction capabilities, especially for soft errors.
In addition to the occurrence of soft errors which can usually be corrected by error correction coding circuitry, there is also the possibility of hard errors. Hard errors often arise out of imperfect manufacturing conditions including device contamination. With increasing memory densities, perfection in chip manufacture is very difficult. Thus, hard errors may be present in addition to soft errors. Moreover, hard errors have the seemingly paradoxical benefit of generally being able to be repeated. It is however the reproducibility of such errors which provides a mechanism for their correction (see below). One of the common forms of hard error occurring in a memory system or chip is the occurrence of a "stuck at" fault in which one of the memory locations continually indicates a zero or one output response in one or more bit positions, irrespective of the specific contents of the memory cell.
While there are many different error correction codes that are applicable and available for use in conjunction with memory systems, one of the most popular class of codes that have been employed for this purpose are codes with a minimum distance of four between code words. Such codes are capable of single error correction and double error detection. These codes are well known and easily implemented and have a proven track record of reliability and ease of manufacture, particularly in terms of simplified circuitry and minimum consumption of chip "real estate". Clearly, single errors, whether hard or soft in nature, pose no problem for such codes. In addition, such codes can detect the presence of double errors, of either the hard or soft variety, but cannot generally correct them. In the event of two soft errors, it does not appear that correction is generally possible using such codes and decoding techniques. However, the presence of two hard errors or one hard and one soft error, does lend itself to the utilization of the complement/recomplement algorithm for double error correction. This algorithm is also referred to as the double complement algorithm. This method is, for example, described in an article by C. L. Chen and M. Y. Hsiao, "Error-Correcting Codes for Semiconductor Memory Applications, a State-of-the-Art Review", IBM Journal of Research and Development, pp. 124-134, March 1984. This algorithm takes advantage of the fact that hard errors are in general reproducible. As a result of this it is possible to be able to identify bit positions which are in error. With this knowledge, double error correction can in fact be carried out. It is thus seen that the reproducibility of hard errors renders it possible to improve the reliability of information storage systems which are subject to hard-hard errors or hard-soft errors without increasing code word length. Thus, aspects of memory chip design which tend to defeat hard error reproducibility also present barriers to system level double error correction especially in systems which are designed around existing single error correction and double error detection codes and circuitry.
Memory architecture itself also plays a role in error correction considerations. In particular, it is often desirable to access a double word (64 bits) of memory data wherein each bit of the double word is supplied from a separate memory chip. This memory architecture is useful in that it can provide reliability and speed advantages. Error correction coding methods are also applied to the double word of data. This is referred to herein as system level error correction (and detection). It is at this level that the complement/recomplement algorithm is employed to correct hard-hard and hard-soft errors, that is, double errors which are of a hard variety. In particular, this means that a certain number of memory chips are solely devoted to the storage of redundant coding information, typically of the parity or check sum variety.
Accordingly, it is seen that it is desirable to construct memory systems which employ on-chip error correction and detection capabilities as a result of high circuit packaging densities. SEC/DED codes are limited to correction of only 1 bit of their data words. For this reason, it is necessary to prevent any and all bit correction upon detection of a multiple error. With data correction inhibited, multiple errors cannot cause the SEC/DED system to erroneously alter a good data bit. The multiple error condition is then cleared during "write back" (i.e., the operation of transferring the on-chip ECC word with its appropriate check bits back into the DRAM cells) through the on-chip ECC system as valid checkbits are generated from the unaltered data word. In this system, damage to the data word integrity is limited to the original multiple errors. Although these errors can no longer be detected, the ECC system cannot cause degeneration of data word during subsequent accesses.
The result of using this method is that all errors at the chip level appear soft. Detection of bad memory cells in manufacturing test is effectively done with pattern testing by comparing expect data with the entire ECC word. The bits in error are easily noted and the quality of hardware under test is easily evaluated. But in actual memory system operations, the total ECC word is not read out of the memory chip. Moreover, the number of bits that are typically read out is small. This greatly increases the probability of missing the bits in error, after a multiple error in the chip data word has occurred. Such uncorrectable errors at the system level tend to cause major system failures. Upon occurrence of such an error, subsequent memory operations generally cease. At the same time, it is also seen that it is desirable to employ system level error correction and detection circuitry to increase memory reliability. It is this situation which produces the problem which is solved by the present invention. In particular, at the system level it is desirable to be able to employ the complement/recomplement algorithm to increase overall memory system reliability particularly through correction of double errors which would not otherwise be corrected. However, the complement/recomplement algorithm depends upon the ability to be able to reproduce hard errors. However, it is noted that the on-chip error correction capability can actually mask the presence of hard errors associated with a given chip. A more detailed example of this phenomenon is described below. Accordingly, the present invention is provided to solve the antagonism that can exist between chip level and system level error correction systems.