Semiconductor storage units made by large scale integrated circuit techniques have proven to be cost-effective for certain applications of storing digital information. Most storage units are comprised of a plurality of similar storage devices or bit planes each of which is organized to contain as many storage cells or bits as feasible in order to reduce per bit costs and to also contain addressing and read and write circuits in order to minimize the number of connections to each storage device. In many designs, this has resulted in an optimum storage device or bit plane that is organized as M words of 1 bit each where M is some power of two, typically 256, 1024 or 4096. Certain contemporary technologies produce devices of 2.sup.14 or more bits. Because of the 1 bit organization of the storage device, single bit error correction as described by Hamming in the publication "Error Detecting and Correcting Codes," R. W. Hamming The Bell System Journal, Volume XXIX, April, 1950, No. 2, pp. 147-160, has proven quite effective in correcting the error of a single storage cell or bit in a given word, i.e., a single bit error, the word being of a size equal to the word capacity of the storage unit, without causing loss of data readout from the storage unit. This increases the effective mean-time-between-failure (MTBF) of the storage unit.
These errors may be classified either as short-lived or long-lived and are designated "transient" (intermittent) or "solid" (permanent, hard), respectively. A transient error may, for example, be the result of a sudden fluctuation in the power supply or the result of a momentary presence of electric or magnetic noise in or near the system. A permanent error may, for example, result from the breakdown of a component such as a transistor or diode. A permanent or solid error is normally the symptom of a component failure, whereas a transient error by its nature may be the result of indeterminate and unrepeatable causes. For their purposes, maintenance personnel must respond to a solid failure with corrective action but are likely powerless to act upon transient errors.
Because the storage devices are quite complex, and because many are used in a semiconductor memory storage unit, they usually represent the predominant component failure in a storage unit. Consequently, it is common practice to employ some form of single bit error correction along the lines described in Hamming. While single bit error correction allows for tolerance of storage cell failures, as more of them fail, the statistical chance of finding two of them, i.e., a double bit error, in the same word increases. Since two failing storage cells in the same word cannot be corrected, it would be desirable to replace all defective storage devices before this occurred, such as at a time when the storage unit would not be in use but assigned to routine preventative maintenance.
While it would be possible to replace each defective storage device shortly after it failed, this normally would not be necessary. It would be more economical to defer replacement until several storage devices were defective thereby achieving a better balance between repair costs and the probability of getting a double failure in a given word. One technique for doing this is to use the central processor to which the storage unit is connected to do this as one of its many other tasks under its normal logic and program control. However, this use of processor time effectively slows down the processor for its intended purpose since time must be allocated to log errors from the storage unit. The effect of this can be better understood when it is noted that a complete failure of a storage device in an often-used section of the storage unit may require a single error to be reported every storage cycle. Since the processor may need several storage cycles to log the error, a great loss of performance would result. One method which has been used to alleviate this is to sample only part of the errors, but this causes lack of logging completeness.
The present art uses a technique referred to as "error logging" disclosed by Petschauer in U.S. Pat. No. 3,999,051. The problem with the Petschauer approach is its inability to distinguish between transient and solid errors, thereby notifying operators of and scheduling maintenance periods partially as a function of transient errors which are of little immediate operational concern.
The novel procedure described herein alleviates the above problem by distinguishing between solid and transient errors and further reports only those conditions wherein a solid error (and, therefore, a component failure) are present. This procedure also has the advantage that no modifications need to be made to the logic of the central processor when a storage unit is replaced with one that embodies error correction features. This allows, for example, the inclusion of error correction in a storage unit and connection of it to an existing or in-use processor without any changes to the processor at installation time.