Semiconductor storage units made by large scale integrated circuit techniques have proven to be cost-effective for certain applications of storing digital information. Most storage units are comprised of a plurality of similar storage devices or bit planes each of which is organized to contain as many storage cells or bits as feasible in order to reduce per bit costs and to also contain addressing, read and write circuits in order to minimize the number of connections to each storage device. In many designs, this has resulted in an optimum storage device or bit plane that is organized as N words of 1 bit each, where N is some power of two, typically, 256, 1024 or 4096. Because of the 1 bit organization of the storage device, single bit error correction as described by Hamming in the publication Error Detecting and Correcting Codes, R. W. Hamming, The Bell System Journal, Vol. XXIX, April, 1950, No. 2, pp. 147-160, has proven quite effective in allowing partial or complete failure of a single storage cell or bit in a given word, i.e., a single bit error, the word being of a size equal to the word capacity of the storage, unit, without causing loss of data readout from the storage unit. This increases the effective mean-time-between-failure (MTBF) of the storage unit.
Becauses the storage devices are quite complex, and because many are used in a semiconductor storage unit, they usually represent the predominant component failure in a storage unit. Consequently, it is common practice to employ some form of single bit error correction along the lines described by Hamming. While single bit error correction allows for tolerance of storage cell failures, as more of them fail the statistical chance of finding two of them, i.e., a double bit error, in the same word increases. Since two failing storage cells in the same word cannot be corrected, it would be desirable to replace all defective storage devices before this occurred, such as at a time when the storage unit would not be in use but assigned to routine preventative maintenance.
While it would be possible to replace each defective storage device shortly after it failed, this normally would not be necessary. It would be more economical to defer replacement until several storage devices were defective thereby achieving a better balance between repair costs and the probability of getting a double failure in a given word. One technique for doing this is to use the central processor to which the storage unit is connected to do this as one of its many other tasks under its normal logic and program control. However, this use of processor time effectively slows down the processor for its intended purpose since time must be allocated to log errors from the storage unit. The effect of this can be better understood when it is noted that a complete failure of a storage device in an often-used section of the storage unit may require a single error to be reported every storage cycle. Since the processor may need several storage cycles to log the error, a great loss of performance would result. One method which has been used to alleviate this is to sample only part of the errors, but this causes lack of logging completeness.
The novel procedure described herein alleviates the above problem by not reporting the same defective device every time it is read out. This procedure also has the advantage that no modifications need to be made to the logic of the central processor when a storage unit is replaced with one that embodies error correction features. This allows, for example, the inclusion of error correction in a storage unit and connection of it to an existing or in-use processor without any changes to the processor at installation time.