1. Field of the Invention
This invention relates generally to the automated isolation of faulty Field Replaceable Units (FRUs) in an electromechanical system responsive to a stream of system performance error data and specifically to a microcode-embodied process for isolating failed or failing FRUs using automated isolation testing procedures and a weighted combination of FRU probability-of-failure data.
2. Discussion of the Related Art
Modern fault detection practice for complex electronic equipment involves automated performance monitoring and the creation and display of fault codes responsive to equipment failures. Usually, such fault codes are reviewed by a maintenance technician and interpreted according to a published maintenance manual. Based on trouble-shooting maintenance tables, FRU isolation procedures are manually selected and performed and any resulting new fault codes are then reinterpreted according to the maintenance manual. This process continues iteratively until the maintenance technician succeeds in replacing all faulty FRUs.
The faulty FRU isolation process presents two fundamental problems. The first problem is the need for continual updates to published maintenance manuals to reflect system design changes. As system components and FRUs are redefined or improved, failure of a particular FRU may introduce new symptoms, which must be interpreted using existing or modified fault codes. Also, as field maintenance experience grows, particular fault codes may be reinterpreted as isolating different FRUs, thereby requiring changes to product field documentation. Serious maintenance delays may arise when product documentation updates are not tightly coupled with system design changes.
The second fundamental problem relates to the FRU isolation process itself. The usual practice is to report system faults as a series of predefined symptoms or Fault Symptom Codes (FSCs). Any particular FSC may arise because of the failure of any one or more of a large plurality of FRUs. The laborious process of isolating the faulty FRU from such a large plurality is usually accomplished by selecting and performing several predetermined isolation test procedures, each of which gives rise to additional FSCs that may then be analyzed in turn to narrow the field of candidate FRUs. While the test procedures are themselves automated, the usual practice is to isolate faulty FRUs through some combination of manual FSC analyses and manual selection and initiation of isolation tests.
Practitioners in the art have proposed several automated improvements to the fault isolation procedures known in the art. For instance, in U.S. Pat. No. 3,928,830, L. R. Bellamy et al. disclose a simple diagnostic system that automatically monitors FRUs for failures or out-of-tolerance conditions, latching up a display to show which FRUs are out of tolerance. In U.S. Pat. No. 4,687,988, Edward B. Eichelberger et al. disclose a weighted random-pattern testing technique that isolates failures in VLSI circuit devices through correlation of an output signature, representing a collection of the output responses from each output terminal in parallel, with predetermined pseudo-random patterns at the circuit device inputs.
These two test procedure concepts are combined in U.S. Pat. No. 4,766,595, wherein Bernard P. Gollomp adds a simple behavior model incorporating predetermined design knowledge of the unit undergoing test. Gollomp's specification-based diagnostic approach reduces the number of isolation test procedures automatically performed to isolate a faulty FRU. Essentially, Gollomp combines a symptom-based diagnostic approach with a specification-based diagnostic approach to balance the number of automatic diagnostic tests against the usual requirement for human technical judgment during manually-initiated testing.
Other "expert system" approaches to the isolation problem include the methods disclosed by Mark E. Clark et al. in U.S. Pat. No. 4,881,230 and by Luan J. Denny in U.S. Pat. No. 4,817,092. These techniques employ decision trees and an expert system to further refine the identification of a faulty FRU, to eliminate all functional components of the faulty FRU and to automatically execute-diagnostic test procedures to assure that no extraneous errors from other system elements are causing the fault symptom.
Further, S. M. Douskey ("Recording Failure Information on Returned Hardware", IBM Technical Disclosure Bulletin, Vol. 32, No. 7, pp. 227-228, December 1989) describes a method for using a nonvolatile RAM in each FRU to permanently store failure history data for use in board-level factory analysis and repair. J. Combes, et al. ("Automatic Analysis of Error Records in a Communication Controller for Optimum Maintenance", IBM Technical Disclosure Bulletin, Vol. 31, No. 6, pp. 138-141, November 1988) disclose an improvement for the automatic analysis of Fault Symptom Code (FSC) error record by a service processor, which generates a reference code that directly denominates the faulty FRU. Combes et al. suggest automated translation of a FSC into a FRU identification but neither teach nor suggest the automated isolation of FRUs, instead teaching the manual selection and performance of diagnostic testing in the usual manner.
More recently, in U.S. Pat. No. 5,293,556, Fletcher L. Hill et al. disclose a knowledge-based FRU management scheme that provides a guided testing procedure by which service personnel are prompted to manually select candidates for FRU swaps or replacements. FRU change and swap detection is prompted as a function of FRU memory scanning together with embedded database query activity and, for swap activity, virtual Composite Failure Events are created and maintained to manage post-swap failure symptoms. Post-swap and post-change validation logic verify the correct positioning and functioning of the target FRUs. Hill et al. neither consider nor suggest automated isolation testing, instead describing a sophisticated expert system for use in prompted manual maintenance.
Similarly, in U.S. Pat. No. 5,253,184, Donald Kleinschnitz discloses a failure and performance tracking system that assists in failure isolation by using an expert system with rules, hypotheses and collective historical data. Kleinschnitz provides memory in each FRU to accumulate and store performance history data. Data is also accumulated to show failed operational components within each FRU as well as any prior maintenance activity in which the FRU came under suspicion. Kleinschnitz teaches the use of a series of retry attempts, each failure of which produces a new set of fault symptom data that successively narrows the field of suspect FRUs. While Kleinschnitz suggests the use of statistical failure data stored within each FRU, he neither considers nor suggests the use of this statistical failure data for FRU isolation but instead merely suggests using such data to predict the next likely time of failure for the particular FRU. Essentially, Kleinschnitz teaches a neural-network based maintenance system that adapts to changes in failure symptoms over time, gradually learning more useful interpretations for each failure symptom code. He neither considers nor suggests the automated isolation testing of a faulty system responsive to fault system codes and provides no specific isolation procedures other than a general neural-based expert system that performs an isolation analysis responsive to a fault persistence counter reading that exceeds a predetermined persistence threshold. Human interaction is required because the system merely guides the maintenance engineer in his evaluation of a failed FRU and the maintenance engineer responds to the system with his human observations.
Advantageously, the expert-system approach to failure isolation does much to resolve the first field publication coupling problem mentioned above because the expert system can readily adjust to accommodate system design changes affecting FSC interpretation. Moreover, improved isolation testing automation minimizes the requirement for manual maintenance engineering involvement, thereby minimizing system downtime during repair. Nevertheless, there is still a clearly-felt need in the art for more fully automatic isolation of faulty FRUs and of the faulty Replaceable Components (RCs) within each FRU. The related unresolved problems and deficiencies are clearly felt in the art and are solved by this invention in the manner described below.