1. Field of the Invention
The invention relates generally to high-density computer systems. The invention relates more specifically to a signature analysis method which assists in identifying the more probable ones of plural field replaceable units from which correctable errors arise.
1a. Cross Reference to Related Applications
The following copending U.S. patent application(s) is/are assigned to the assignee of the present application, is/are related to the present application and its/their disclosure(s) is/are incorporated herein by reference:
(A) Ser. No. 07/670,289 entitled "SCANNABLE SYSTEM WITH ADDRESSABLE SCAN RESET GROUPS", by Robert Edwards et al, which was filed Mar. 15, 1991 [Attorney Docket No. AMDH7954].
(B) Ser. No. 7/813,891, entitled "IMPROVED METHOD AND APPARATUS FOR LOCATING SOURCE OF ERROR IN HIGH-SPEED SYNCHRONOUS SYSTEMS," by Christopher Y. Satterlee et al., filed Dec. 23, 1991 [Attorney Docket No. AMDH7952].
1b. Cross Reference to Related Patents
The following U.S. Pat. No.(s) is/are assigned to the assignee of the present application, is/are related to the present application and its/their disclosures is/are incorporated herein by reference:
(A) 4,244,019, DATA PROCESSING SYSTEM INCLUDING A PROGRAM-EXECUTING SECONDARY SYSTEM CONTROLLING A PROGRAM-EXECUTING PRIMARY SYSTEM, issued to Anderson et al, Jan. 6, 1981; PA0 (B) 4,661,953, ERROR TRACKING APPARATUS IN A DATA PROCESSING SYSTEM, issued to Venkatesh et al, Apr. 28, 1987; PA0 (C) 4,679,195, ERROR TRACKING APPARATUS IN A DATA PROCESSING SYSTEM, issued to Dewey, Jul. 7, 1987; PA0 (D) 4,791,642, BUFFER ERROR RETRY, issued to Taylor et al, Dec. 13, 1988; PA0 (E) 4,819,166, MULTI-MODE SCAN APPARATUS, issued to Si et al Apr. 4, 1989; PA0 (F) 4,852,100, ERROR DETECTION AND CORRECTION SCHEME FOR MAIN STORAGE UNIT, issued to Christensen et al, Jul. 25, 1989; and PA0 (G) 4,855,904, CACHE STORAGE QUEUE, issued to Daberkow et al, Aug. 8, 1989.
2. Description of the Related Art
Modern computer systems are typically constructed of a large number of field replaceable units or FRU's. Examples of FRU's include edge-connected circuit boards, quick-disconnect cables, rack-mounted replacement boxes and other easily-replaceable components. Their use simplifies the repair of in-the-field systems.
When in-the-field system errors occur, and repair by FRU substitution is desired, the main problem is deciding which FRU to replace.
There are two traditional approaches.
Under one approach, the field engineer tries first to pinpoint precisely the circuit from which system errors arise and then the field engineer replaces the associated FRU.
Under the other approach, the field engineer bypasses the task of identifying the errant circuit with any degree of specificity. Instead, the field engineer weeds out the defective FRU through a process of simple trial and error substitution.
Each approach has its share of advantages and disadvantages. The approaches are not mutually exclusive. An engineer can start with one approach and switch to the other in midstream.
For the sake of convenience, the first of the above-mentioned approaches will be referred to as "pinpoint isolation," or "Pin-I" for short. The second of the above-mentioned approaches will be referred to as "Simple Trial and Error FRU Substitution," or "STEFS" for short.
If the Pin-I approach is selected, the step of localizing the source of a problem with pinpoint precision is the main problem. Pinpoint isolation takes time and requires appropriate analysis tools. The upstream costs are relatively high, but successful isolation of faults offsets the downstream costs of not being sure whether the source of each problem has been eliminated.
Trial and error FRU substitution (STEFS), on the other hand, might eliminate a problem in less time and with less up front costs. It does not call for deep analysis. But one can never be sure during, or even after the STEFS process completes, whether a problem has been removed or simply covered up by some mechanism relating to a newly substituted FRU.
STEFS is often selected as the preferred repair method in field environments, particularly where it is believed that isolating the faulty circuit with pinpoint precision will take too long and where the only indication of error is a broad based one (e.g., a checksum error) which is given by an error mechanism that detects but does not localize errors.
Typically, a major part of a computer system or all of the system has to be shut down during the simple trial and error substitution process. System down-time is a function of the total number of FRU's and pure luck.
The field engineer starts going down the line blindly replacing boards and cables (or other FRU's) which are currently installed in the computer with factory pre-tested boards and cables (or other FRU's). Each unsuccessful swap is immediately undone. The engineer continues to swap boards, cables or other FRU's on an arbitrary basis until the error indication goes away. The board, cable, or other FRU last replaced at the time the error indication disappears is tagged as the probable source of error and taken back to the factory for further analysis and/or repair.
The STEFS method suffers from several drawbacks.
A first problem is that the computer system may have to be shut down for excessively long periods of time as one FRU substitution after another is tried. The likelihood of a long shutdown grows as the number of FRU's increases. It is particularly a problem in large computer systems such as mainframe computers which can have tens or hundreds of FRU's. Each interconnect cable, daughterboard, motherboard, chassis and/or box is considered a FRU.
In the more difficult cases, a system may have hundreds of FRU's and it may be found that the error indication (e.g., checksum error) remains active even after more than 90% of the FRU's have been substituted.
Worse yet, the error indication might disappear after a FRU in the last ten percent has been replaced and then miraculously re-appear as soon as the field engineer leaves the site.
An error indication can temporarily go away simply because a field engineer unknowingly applies stress to a loose wire/contact or otherwise alters conditions in one location while substituting a FRU in a different location. The last replaced board is then incorrectly deemed to be the source of error and returned to the factory for rework. But after the field engineer leaves, the error indication suddenly reappears; primarily because its underlying cause was never corrected in the first place.
A second drawback of the STEFS method is that it is ill adapted for eradicating errors caused by intermittent faults. When the duration between intermittent faults is rather lengthy, and varies randomly, it is not always immediately clear at the time STEFS is being carried out whether the most recently substituted FRU is responsible for the nonoccurrence of error indications in, say the last hour, or whether the intermittent fault indication will reappear if one waits just a little bit longer. It is not clear how long one should wait before concluding that the problem is fixed.
A further drawback of the STEFS method is the inconvenience of requiring field engineers to bring entire repertoires of FRU's to every service job. If there is no indication that makes one FRU more suspect than others, all FRU's have to be considered equally suspect. Substitutes for all in-field FRU's (e.g., cables, daughterboards, motherboards, etc.) have to be brought along to every in-the-field service job.
In many instances, it is desirable to minimize the amount of time spent completing a service job. System down time can be quite expensive.
A prolonged attempt to remove an error condition by way of simple trial and error FRU swapping (STEFS) is most frustrating when the field engineer is not having any luck in quickly eliminating the error indications (perhaps, because their underlying cause is intermittent and somewhat infrequent), and it is realized that the type of errors the engineer is trying to eradicate are self-correctable ones.
Some additional explanation is warranted here.
There are two basic types of system errors: self-correctable ones and system non-correctable ones.
Self-correctable errors are defined as those which the computer can automatically correct on its own, either by retrying a failed operation or by reconfiguring its data paths to circumvent a failed component or by using ECC (Error Correction Code) techniques or through other self-correcting mechanisms. Such self-corrections provide a temporary rather than permanent fix for certain kinds of faults.
One common example of a self-correctable error is a bit flip in memory due to an alpha particle emission. Most computers have parity/retry or ECC mechanisms for detecting and correcting such an error.
Another example of a self-correctable error is one caused by an intermittent open or short circuit which becomes good when a failed operation is retried.
System non-correctable errors are defined here as those arising from faults which the computer cannot correct or circumvent on its own. Human intervention is required.
An example is an error arising from a cable connection which becomes completely undone. If there is no alternate path for signals to travel on, then no amount of operation retries or path reconfigurations will correct the error. The system has to be shut down and the faulty part or parts have to be located and repaired or replaced before users can be safely allowed to come back on line.
Although it is possible, and sometimes preferable, to ignore and repeatedly bypass faults which give rise to self-correctable errors (for example, non-fixable faults such as those which produce a irreducible number of alpha particle emissions), some faults (e.g., intermittent open/short circuits) should be viewed as precursors to, or warnings of, an upcoming non-correctable error.
A more permanent form of correction other than system self-correction becomes advisable when the total number of self-correctable errors in a predefined time period exceeds a predefined threshold. At that point, a judgment is made that it is better to try to find the causes for the self-correctable errors and to permanently fix those that are fixable by, for example, replacing the associated FRU at a conveniently scheduled time (periodic maintenance), rather than waiting and allowing flaws to accumulate or worsen to the point where they produce a system non-correctable, fatal problem at an inconvenient time.
Despite the long-term wisdom of such judgment, when a scheduled repair job is undertaken, and the selected approach is simple trial and error FRU substitution (STEFS); it can become extremely frustrating to have the system shut down for excessively long periods of time. The explanation might be that bad luck was encountered in choosing which FRU to replace fist or that the cause of errors is difficult to eliminate by way of STEFS because the fault is both infrequent and intermittent. But this does not lessen the problem.
The frustration grows yet worse, if the goal of eliminating an excessive number of self-correctable error is not attained after the computer has been shut down for an excessively long time span, and because of other considerations, it is now necessary to turn the system back on. By way of example, the repair attempt may have to be continued at some other time because a scheduled maintenance period is now over.
The frustration is heightened by the awareness that the computer could have been up and running during the entire nonproductive time span but for the decision having been made to use STEFS as the way to eradicate the source of the excess number of self-correctable errors.
The slowness of the STEFS method, particularly in the case where infrequent intermittent errors are involved, is directly attributable to a total lack of knowledge about where the faulty circuit might be located.
A brute force solution to this problem would seek to position error detectors (e.g. parity checkers) at all conceivable points in the computer system where faults may develop. Ideally, every FRU would have its own set of error detectors and these detectors would be coupled to all circuits contained in that FRU. Each source of errors within a faulty FRU would activate the corresponding set error detectors and thereby pinpoint the faulty FRU with specificity.
Such a brute force approach is rarely practical however. Provisions for error detectors have to be judiciously incorporated into computer systems in limited numbers and at points which provide an optimum cost-versus-performance return on investment.
It is not realistic to consider locating error detectors everywhere in very large, high-density computer systems (e.g. IBM or Amdahl mainframes). There are too many circuits and too many potential sources of error. Costs would become prohibitive.
In many instances, functional circuits have to be packed so tightly in a computer (for speed and other practical reasons) that there is no room for including error detecting circuitry in the first place.
Heretofore, in situations where error detecting circuitry could not be included in every conceivable part of every FRU, the field engineer was left to the mercy of blind luck when he/she had to locate an errant FRU on the basis of simple trial and error substitution. A better approach is needed.