In recent years, continuing cost and performance improvements in solid state memory devices, such as charge coupled devices, as well as metal oxide semiconductor and bipolar random access memory devices, have allowed these devices to become increasingly popular for use as memory means for storing large quantities of digital data thus replacing magnetic storage means to some extent. However, these components are susceptible to failure, and have not yet reached the degree of reliability required of large memory systems. Fortunately, the organizations of these devices is such as to permit relatively low cost error detection and correction techniques which can, in many cases, be used to overcome this defect.
The specifies of organization referred to above which enable the relatively inexpensive implementation of these error detection and correction schemes are that the memory be organized in an orthogonal fashion, whereby a quantity of words of a given length (in a preferred embodiment of the present invention the word length is nine 8-bit bytes) are arranged to form records which may be on the order of 500 to 2000 words long. Thus a record may be 72 bits "high" by 1000 bits "wide". This orthogonality permits the digital data to be stored (hereinafter "information" data) to be examined at regular intervals so as to generate additional error-correction data during the writing operation to be stored along with the information data, which can later be used to determine whether or not the information data has been accurately read.
Another distinct consideration which must be faced is that there are two types of errors which are common to these solid state memory devices. In particular, there are "hard errors" wherein a single semiconductor device partially or completely, but permanently, fails, in addition to "soft" errors, which are temporary and which may occur anywhere within a given memory device and may be caused by a variety of sources. Given these two types of errors, it is clearly incumbent upon the memory system designer to devise an error detection and correction method which will correct for these two types of errors in the vast majority of cases without adding any more data storage space requirement for error correction data than the minimum necessary.
As discussed in an article "Semiconductor Memory Reliability with Error Detecting and Correcting Codes" Levine and Meyers, Computer, October 1976, page 43, the type of error correction coding most suitable for main memory is based on the Hamming single error correcting and double error detecting codes. This code involves adding redundancy to a given data field (in the present embodiment, a word). As will be explained below, a number of check bits are generated at the time of writing data into memory and are stored along with the information data to which they correspond. Upon reading the data, a similar number of check bits are again generated, according to the same algorithm originally used to generate the first group of check bits, and the thus-generated check bits are compared with the stored check bits. If they are identical, it is presumed that no errors have occurred in the read operation or in storage. If the two groups of check bits differ by a single bit, the corresponding check bit is in error. If the two groups of check bits differ by one of a number of predetermined combinations of bits, the combination of bits can be decoded to identify the location of the information bit in error. This bit can therefore be corrected by being inverted, since in a digital embodiment the only possible error is having a zero transposed into a one or vice versa. Hence, all that is required to correct an error is to identify its location. While this scheme is very useful, it has a significant limitation: it is only capable of identifying (and hence correcting) a single bit error. If a multiple bit error has occurred, this scheme can detect that fact but cannot correct it.
One improvement on this type of error correction scheme which is known is to save the location of the most recently identified error location. In this way, if the bits comprising the particular word are spread out over an equal number of individual solid state devices and if the error is due to a permanent or "hard" failure of one of the devices, the bit at this location may frequently be in error, and may be inverted. If such bit is in fact in error, a second error can then be detected and corrected according to the ordinary scheme. The difficulty with this approach is that according to this scheme the Hamming error detection means cannot tell whether two errors have been corrected or whether a multiplicity of additional errors have now been made by the inversion of these bits since, as discussed above, the error detection scheme mentioned is only capable of detecting 2 bit errors and correcting a single bit error. A number of errors in excess of one cannot be distinguished from one another, that is to say, any number of errors over one is, in the eyes of the error detection scheme discussed above, indistinguishable from any other; three errors cannot be told from 9.