1. Technical Field
The present invention relates generally to error correction codes, and in particular, to utilizing correctable error analysis to identify otherwise undetected multi-bit errors.
2. Description of the Related Art
Many hardware diagnostic tests for memory arrays or buses rely on hardware-generated error correction codes (ECCs) which detect and correct single-bit errors known as correctable errors (CEs). Such ECCs are often further enabled to detect, but not correct, multi-bit errors known as uncorrectable errors (UEs). A primary goal of ECC diagnostics testing is to identify the locations of UEs so that hardware containing UEs can be deconfigured.
Robust ECC testing procedures have long been recognized as a practical necessity for main storage on large scale computer systems such as the S/390 Parallel Enterprise Server systems available from IBM Corporation. S/390 and IBM are registered trademarks, and S/390 Parallel Enterprise Server is a trademark of IBM Corporation. Since the main storage on such large systems often serves as the central data repository accessed by disparate users throughout an enterprise, the criticality of preserving the integrity of the massive amount of data stored on such large systems is readily apparent.
Hardware-generated ECC results are generated and processed with respect to individual test patterns. Therefore, an UE will only be detected if a test pattern applies logic levels to the faulty bit locations that are opposite the levels the faulty bits are stuck at. An UE is easily detected if it comprises two bits that are stuck at the same logic level. In such cases, a uniform pattern of either all logic lows or all highs (e.g. 0x00000000 or 0xFFFFFFFF) will expose the UE. If, however, one of the faulty bits is stuck high and another of the faulty bits is stuck low, the pattern matching requirement for a successful detection pattern is much more exacting since it requires that opposite level test pattern bits be simultaneously applied to each of the faulty bit locations. UE detection becomes even more difficult when the faulty bit locations are not persistently stuck at particular levels, but instead fail intermittently.
A known solution to testing for and detecting UEs having multiple logic levels is to utilize multiple test patterns containing variations of alternating high and low bits. For example, a common set of patterns may include: 0x00000000, 0xFFFFFFFF, 0xAAAAAAAA, 0x55555555, 0xCCCCCCCC, 0x33333333, 0xF0F0F0F0, and 0x0F0F0F0F. The number and type of patterns are selected to achieve a desired coverage level for reliable UE detection.
While improving the reliability of detecting UEs having bad bits stuck at multiple logic levels, several problems remain unresolved. For example, if bit locations bn and bm are spread sufficient far apart and are stuck at opposite logic levels, many multi-pattern ECC tests will detect two single-bit errors rather than a multi-bit error. This occurs when the faulty bits stuck at opposite levels are farther apart than the cycle of repeating bits in each pattern. Conventional multi-pattern ECC testing also fails to adequately address the problem of intermittently occurring multiple-bit errors. For an intermittently occurring multiple-bit error, the multi-pattern testing sequence might detect less than all of the faulty bits per test pattern, so that one pattern may detect a perceived CE and a different pattern detects another incorrectly perceived CE. For both the bit spread issue and intermittent fault issue, increasing the number of patterns expands UE detection coverage, but also increases the costs associated with extra test pattern coverage.
It can therefore be appreciated that a need exists for a method, system, and computer program product that address problems relating to reliably and comprehensively detecting UEs with a limited test pattern cycle range. The present invention addresses this and other needs unresolved by the prior art.