a. Field of the Invention
The present invention relates to computer memory storage systems, and more particularly relates to storage systems employing scrubbing and sparing.
b. Related Art
Various arrangements have been suggested in the prior art which permit semiconductor memories to recover from defective data word bit positions caused by soft (transient) and/or hard (non-transient) errors. The data word, for example, may comprise 72 bit positions where 64 positions store data and 8 positions are employed for an error correcting check byte which, when processed by a suitable error correcting system associated with the memory, is capable of automatically correcting a single-bit error in any one of the bit positions of the word. Most systems also are capable of detecting multi-bit errors and are generally designed from a code standpoint so as not to miscorrect any of the good data bits.
The prior art also includes systems for correcting double bit errors. An article entitled "MULTIPLE ERROR CORRECTION" (IBM Technical Disclosure Bulletin, Vol. 13, No. 8, January, 1971, Pg. 2190) describes a circuit for automatically correcting multiple bit errors. When a double error is detected, the word fetched from memory is read into an error register and the complement of the fetched word is rewritten back into the original memory location. A fetch cycle is then executed on the complement of the fetched word. The word and its complement are compared in an Exclusive OR circuit that identifies the location of the failing bits. This information is utilized to complement the incorrect bits in the original fetched word. The information concerning the failing bits is also stored with the address position of the error. When an error is later detected and there is an address match with the address of the earlier error, the failing bits in the new error are corrected automatically. Another scheme, which corrects double bit errors by using an Error Correction Code (ECC) check syndrome in conjunction with a complement/recomplement type algorithm is described in an article entitled "MULTIPLE MEMORY ERROR CORRECTION" (IBM Technical Disclosure Bulletin, Vol. 24, No. 6, November, 1981, Pg. 2690).
In order to correct soft errors that tend to occur in the memory array between refresh cycles, many conventional systems implement a technique known as "scrubbing". During a scrubbing cycle, each memory location in an array is accessed sequentially and the data within is read. Typically, ECC logic checks each data word and corrects any single bit errors. The data is then restored to memory. If the single bit error was related to a soft error, the restore operation puts corrected data in place of the bad data that was the soft fail.
The prior art has recognized that certain types of fault conditions in semiconductor memories are basically data dependent in that when a data bit is read out from the faulty position, it is always one binary value or the other. Such errors are commonly referred to as "hard" errors. A mechanism which operates during scrubbing to determine whether a single bit error is a soft error or a hard error is described in an article entitled "HARDWARE MECHANISM TO DETERMINE THE TYPE OF SINGLE BIT MEMORY ERROR" (IBM Technical disclosure Bulletin, Vol. 32, No. 4B, September, 1989, Pg. 241).
Most single and double bit "hard" errors can be corrected using the same error correction techniques as are utilized for soft errors. Some hard errors are, however, uncorrectable. An uncorrectable error will only occur if a random error, hard or soft, occurs at some other bit position at the same time the first defective bit position contains a binary value that is different than the value originally written to that position. Where a bit position in a data word has a "hard error" the likelihood that an uncorrectable error will eventually occur is substantially increased. Since such a data word will always include at least a single bit error, the occurrence of any additional hard or soft errors may cause the data word to become uncorrectable.
To handle instances where a bit position has failed due to a hard error, some prior art systems have been provided with a capability known as "sparing". Sparing (also known as "bit-steering") refers to the replacement of an identified defective bit position by logically steering a bit from a replacement chip into the defective bit position, effectively replacing the defective position. For example, in U.S. Pat. No. 4,584,682 to Shah et al. an array substitution scheme is used to substitute a spare chip for a faulty chip when an uncorrectable error condition results from an alignment of two errors in bit positions accessed through the same decoder, while a bit permutation apparatus is used to misalign faulty bits when PG,5 they occur in positions accessed through different decoders.
In an article entitled "DYNAMIC SPARING OF STORAGE MODULES" (IBM Technical Disclosure Bulletin, Vol. 29, No. 7, December, 1986, pp. 2828-2829) a method of dynamically sparing a storage module without system disruption is described. The method includes detection of a faulty storage module as well as its replacement with a spare module. The memory is organized or mapped such that each bit of a memory word is associate with a unique storage module. The method of the above described article relies on the use of scrubbing (a conventional technique used to remove correctable soft errors from a storage subsystem in its detection stage). During scrubbing, the error correction code (ECC) generates a syndrome (i.e. a series of bits encoded to contain information about the correctness of the data word) for each word it reads and rewrites. During a given scrub pass the syndrome of any single bit error (SBE) occurrence is held. If the same syndrome occurs more than N times during that scrub pass, the bit indicated by the syndrome is identified for sparing. During the next scrub pass, the bit in question is stored back into both the old location and the spare. At the end of this pass, the spare bit is switched into use. This allows the system to run and use the storage in question with minimum impact.
While the above-described systems provide an increased degree of memory reliability they leave a number of problems unresolved. For example, all of the above-described error correction methods fall short when more than two errors occur simultaneously in a single ECC data word. In cases where the data word has more than a double bit error a correct result is not ensured. Further, in such cases where more than two errors occur simultaneously in the same ECC data word, the erroneous bit locations cannot be identified from the ECC tree thus limiting the capability of the sparing system to timely swap the failing bit positions. Thus there is a need for a sparing system that does not rely on ECC for its implementation and can identify permanent data errors in all bits of the ECC data word simultaneously.