The present invention relates to monitoring, detecting, and isolating failures in a system, and in particular to tools applied for analyzing the system.
xe2x80x9cTo diagnosexe2x80x9d means to determine why a malfunctioning device is behaving incorrectly. More formally, to diagnose is to select a subset of a predetermined set of causes responsible for the incorrect behavior. A diagnosis must both explain the incorrect behavior and optimize some objective function, such as probability of correctness or cost of incorrect diagnosis. The need to diagnose is a common reason to measure or to test.
The diagnosis of an engineered device for the purpose of repair or process improvement shall now be regarded. This is in contrast to, say, a distributed computer system containing software objects that may be created or destroyed at any time. It is assumed that the device consists of a finite number of replaceable components. Failures of the device are caused only by having one or more bad components. What shall be called herein xe2x80x9cdiagnosisxe2x80x9d is often called xe2x80x9cfault identificationxe2x80x9d. When presented with a failed device, a technician or a computer program (sometimes called a xe2x80x9ctest executivexe2x80x9d) will run one or more tests. A technician familiar with the internal workings of a failing device must interpret the test results to identify the bad components.
Expert systems have been used for diagnosing computer failures, as described e.g. by J. A. Kavicky and G. D. Kraft in xe2x80x9cAn expert system for diagnosing and maintaining the ATandT 3B4000 computer: an architectural descriptionxe2x80x9d, ACM, 1989. Analysis of data from on-bus diagnosis hardware is described in Fitzgerald, G. L., xe2x80x9cEnhance computer fault isolation with a history memory,xe2x80x9d IEEE, 1980. Fault-tolerant computers have for many years been built with redundant processing and memory elements, data pathways, and built-in monitoring capabilities for determining when to switch off a failing unit and switch to a good, redundant unit (cf. e.g. U.S. Pat. No. 5,099,485).
Prior diagnostic systems for determining likely failed components in a system under test (SUT) include model-based diagnostic systems. A model-based diagnostic system may be defined as a diagnostic system that renders conclusions about the state of the SUT using actual SUT responses from applied tests and an appropriate model of correct or incorrect SUT behavior as inputs to the diagnostic system. Such a diagnostic system is usually based upon computer-generated models of the SUT and its components and the diagnostic process.
Model-based diagnostic systems are known e.g. from W. Hamscher, L. Console, J. de Kleer, in xe2x80x98Readings in system model-based diagnosisxe2x80x99, Morgan Kauffman, 1992. A test-based system model is used by the Hewlett-Packard HP Fault Detective (HPFD) and described in HP Fault Detective User""s Guide, Hewlett-Packard Co., 1996.
U.S. Pat. No. 5,808,919 (Preist et al.) discloses a model-based diagnostic system, based on functional tests, in which the modeling burden is greatly reduced. The model disclosed in Preist et al. employs a list of functional tests, a list of components exercised by each functional test along with the degree to which each component is exercised by each functional test, and the historical or estimated a priori failure rate for individual components.
U.S. Pat. No. 5,922,079 (Booth et al.) discloses an automated analysis and troubleshooting system that identifies potential problems with the test suite (ability of the model to detect and discriminate among potential faults), and also identifies probable modeling errors based on incorrect diagnoses.
EP-A-887733 (Kanevsky et al.) discloses a model-based diagnostic system that provides automated tools that enable a selection of one or more next tests to apply to a device under test from among the tests not yet applied based upon a manageable model of the device under test.
In the above three model-based diagnostic systems, a diagnostic engine combines the system-model-based and probabilistic approaches to diagnostics. It takes the results of a suite of tests and computesxe2x80x94based on the system model of the SUTxe2x80x94the most likely to be failed components.
The diagnostic engine can be used with applications where a failing device is to be debugged using a pre-determined set of test and measurement equipment to perform tests from a pre-designed set of tests. Using test results received from actual tests executed on the SUT and the system model determined for the SUT, the diagnostic engine computes a list of fault candidates for the components of the SUT. Starting, e.g., from a priori (that is, formed or conceived beforehand) failure probabilities of the components, these probabilities may then be weighted with the model information accordingly if a test passes or fails. At least one test has to fail, otherwise the SUT is assumed to be good.
An embedded processor is a microprocessor or other digital computing circuit that is severely limited in computing power and/or memory size because it is embedded (i.e., built in to) another product. Examples of products typically containing embedded processors include automobiles, trucks, major home appliances, and server class computers (that often contain an embedded maintenance processor in addition to the Central Processing Unit(s)). Embedded processors typically have available several orders of magnitude less memory and an order of magnitude or two less computing power than a desktop personal computer. For example, a megabyte of memory would be a large amount for a home appliance. It is desirable to enable. such an embedded processor in such a product to diagnose failures of the product. A diagnosis engine providing such a capability shall be called an embedded diagnosis engine.
It is possible to perform probabilistic diagnosis by various heuristic methods, as applied by the aforementioned HP Fault Detective product or U.S. Pat. No. 5,808,919 (Preist et al.). Heuristics by nature trade off some accuracy for reduced computation time. However, the HP Fault Detective typically requires 4 to 8 megabytes of memory. This is can be a prohibitive amount for an embedded diagnosis engine.
Another method for solving the problem is Monte Carlo simulation. Although the Monte Carlo simulation method can be made arbitrarily accurate (by increasing the number of simulations), the simulation results must be stored in a database that the diagnosis engine later reads. It has been shown that, even when stored in a space-efficient binary format, this database requires 2-6 megabytes for typical applications. This is too much for an embedded application and would be a burden on a distributed application where the database might have to be uploaded on a computer network for each diagnosis.
A common way of building a probabilistic diagnostic system is to use a Bayesian network (cf. Finn V. Jensen: xe2x80x9cBayesian Networksxe2x80x9d, Springer Verlag, 1997). A Bayesian network is a directed acyclic graph. Each node in the graph represents a random variable. An edge in the graph represents a probabilistic dependence between two random variables. A source (a node with no in-edges) is independent of all the other random variables and is tagged with its a priori probability. A non-source node is tagged with tables that give probabilities for the value of the node""s random variable conditioned on all of the random variables upon which it is dependent.
The computation on Bayesian networks of most use in diagnosis is called belief revision. Suppose values of some of the random variables (in the context of herein, the results of some tests) are observed. A belief revision algorithm computes the most likely probabilities for all the unobserved random variables given the observed ones. Belief revision is NP-hard (cf. M. R. Garey and D. S. Johnson: xe2x80x9cComputers and Intractability:
A guide to the theory of NP-completenessxe2x80x9d, W.H. Freeman and Co., 1979), and so all known algorithms have a worst-case computation time exponential in the number of random variables in the graph.
Bayesian networks used for diagnosis are constructed with random variables and their dependencies representing arbitrary cause-and-effect relationships among observables such as test results, unobservable state of the device under diagnosis and its components, and failure hypotheses. The graph can grow very large and have arbitrary topology. For example, an experimental Bayesian network used by Hewlett-Packard for printer diagnosis has over 2,000 nodes. The complexity of such networks creates two difficulties:
all of the conditional probabilities for non-source nodes must be obtained or estimated, and
local changes to topology or conditional probabilities may have difficult-to-understand global effects on diagnostic accuracy.
In other words, the use of a large Bayesian net of arbitrary topology for diagnosis has somewhat the same potential for supportability problems, as do rule-based diagnostic systems.
It is an object of the invention to provide an improved probabilistic diagnosis that can also be applicable for embedded and/or remote applications.
One aspect of the present invention is to provide a diagnosis engine, that is, a tool that provides automatic assistance, e.g. to a technician, at each stage of a debugging process by identifying components that are most likely to have failed.
A major advantage of the present diagnosis engine over other diagnosis engines is that it can be provided with a small memory footprint: both code and runtime memory requirements are small, growing only linearly with the model size.
The diagnosis engine can be embodied in a program storage device, readable by a machine, containing a program of instructions, readable by the machine and preferably written entirely in Java (cf. e.g. James Gosling, Bill Joy, and Guy Steel: The Java Language Specification, Addison Wesley, 1996) and preferably uses only a few classes from the Java standard language library packages. These features make the present diagnosis engine particularly well suited to embedded and distributed applications.
The present diagnosis engine can be used on applications where a failing device is to be debugged using a predetermined set of test and measurement equipment to perform tests from a pre-designed set of tests. For the purposes of herein, a test is a procedure performed on a device. A test has a finite number of possible outcomes. Many tests have two outcomes: pass and fail. For example, a test for repairing a computer may involve checking to see if a power supply voltage is between 4.9 and 5.1 volts. If the power supply voltage is between 4.9 and 5.1 volts, then the test passes. If the power supply voltage is not between 4.9 and 5.1 volts, then the test fails. Tests may have additional outcomes, called failure modes. For example, a test may involve trying to start an automobile. If the automobile starts, then the test passes. Failure modes might include:
the lights go dim when the key is turned, and there is no noise from under the hood,
the lights stay bright when the key is turned, and there is the noise of a single click,
the lights stay bright when the key is turned, there is a click, and the starter motor turns, but the engine doesn""t turn over, and so forth.
The set of all tests available for debugging a particular device is called that device""s test suite. Many applications fit these definitions of debugging and of tests. Examples are:
computer and electronics service and manufacturing rework,
servicing products such as automobiles and home appliances, and
telephone support fits the model, if we broaden the idea of xe2x80x9ctestxe2x80x9d to include obtaining answers to verbal questions.
Given:
a set of tests on a physical object (e.g., Test 1=pass, Test2=fail, Test 3=pass, etc.) where at least one test has failed, and
a model giving the coverage of the tests on the components (e.g., field replaceable units) of the object and information describing probabilistic dependencies between tests,
The diagnostic engine in accordance with the present invention outputs a probabilistic diagnosis of the object, that is, a list, each element of which contains:
a list of one or more components, and
the likelihood or probability that those components are the bad components. (Likelihood is un-normalized probability. That is, probabilities must sum to one but likelihoods need not.)
Most automated diagnosis systems provide simply a list of possible diagnoses without weighting by probability. Having probabilities is particularly desirable in applications where the number of field replaceable units (FRU) is small. The probabilities also give technicians an opportunity to apply their own expertise.
A diagnosis engine in accordance with the present invention allows handling multiple component failures. No distinction is made between single and multiple faults.
A diagnosis engine in accordance with the present invention can combine the model-based (cf. W. Hamscher, L. Console, and J. de Kleer: Readings in model-based diagnosis, Morgan Kauffman, 1992) and probabilistic approaches to diagnostics.
A diagnosis engine in accordance with the present invention can use the same test-based model as by the aforementioned HP Fault Detective or in U.S. Pat. No. 5,808,919 (Preist et al.). This model describes probabilistic relationships between tests and the components that they test in a manner intended to be accessible to engineers who write tests. Features of this model can be preferably:
a two-level part-whole hierarchy: names of components (field-replaceable units) and of their sub-components,
estimates of a priori failure probabilities of the components,
the names of the tests in the test suite,
an estimate of the coverage that each test has on each component, i.e., the proportion of the functionality of the component that is exercised by the test, or more formally, the conditional probability that the test will fail given that the component is bad,
shared coverages of tests, that are a way of modeling tests that are dependent because they test the functionality of some components in exactly the same way (for example, two tests that access a certain component through a common cable have shared coverage on the cable), and
a way of specifying failure modes for tests in addition to pass and fail. Failure modes have a name, and two lists of components or sub-components. The first list, called the acquit list, names the components or sub-components that must have some operable functionality in order for the failure mode to occur. The second list, called the indict list, names the components or sub-components that may be bad if the failure mode occurs. Each entry in the acquit and indict lists also contains an estimate of the amount of functionality of the component that the failure mode exercises.
Models can be created by:
Using a model-building graphical user interface (GUI) that comes e.g. with the aforementioned HP Fault Detective. The HP Fault Detective model is read by a program that translates it into a simpler form used internally by the invention that can be saved as an ASCII file. The invention can load such a file from a file system, from a URL, or from local memory.
Writing ASCII test Fault Detective Model (.fdm) files, or
Using a model creation application programming interface (API) in Java.
The model, together with the rules of mathematical logic, enables one to compute the probability that a test will fail if a particular component is known to be bad. More details about these models and the model-building process are disclosed in the co-pending US patent application (Applicant""s internal reference number: US 20-99-0042) by the same applicant and in U.S. Pat. No. 5,922,079 (Booth et al.). The teaching of the former document with respect to the description of the model and the model-building process are incorporated herein by reference.
A diagnosis engine in accordance with the present invention allows computing the probability of a test""s failure when given any pattern of components known to be good or bad. The logic formula known as Bayes"" Theorem allows running this computation in reverse. That is, given a particular test result, a diagnosis engine in accordance with the present invention can calculate the probability of occurrence of some particular pattern of component faults and non-faults. A diagnosis engine in accordance with the present invention, can then enumerate all the possible patterns of component faults/non-faults, evaluating the probability of each pattern given the test result. The pattern with highest probability is selected as the diagnosis.
Of course, one test is seldom sufficient to make an unambiguous failure diagnosis. If the test succeeds, it may clear some components, but not indicate the culprit. If the test fails, several components may be indicted, and other tests are required to clear some components or focus suspicion on other components. (Here, xe2x80x9cclearingxe2x80x9d means to knock the computed fault probability way down, and xe2x80x9cfocusing suspicionxe2x80x9d means to raise the probability to the top or near the top.) Handling multiple test results is easy and quick if the tests are independent of each other. But if the tests are not independent, the problem is much more complex. The dependence is modeled by the shared functions. A case-by-case breakdown must be made of all the ways the shared functions might pass or fail and how they affect the joint probabilities of the test results. Then all these influences must be summed, as sketched e.g. in the outline of a diagnosis algorithm (in pseudo-code) as shown below:
1. For each possible combination of bad components:
(a) Set sum to 0.
(b) For each possible pass/fail combination of shared functions:
i. Compute the probability of the observed test results.
ii. Add the probability to sum.
(c) Calculate likelihood of the combination of bad components given sum (using Bayes"" Theorem).
2. Sort the fault likelihoods in descending order.
The algorithm iterates over combinations of failed components and computes the conditional likelihood of each combination given passed and failed tests.
Clearly, this method can require enormous amounts of computation as it explores all combinations of shared function outcomes for all combinations of faults.
The mathematical detail how all this is to be accomplished and also how the computational burden is reduced to allow the method to be practical are discussed below in great detail in the section xe2x80x98Detailed Description of the Inventionxe2x80x99.
A Bayesian network can represent any model used by a diagnosis engine in accordance with the present invention. The resulting graph is tripartite, consisting solely of sources, sinks, and one level of internal nodes (as shown later). There is one source for each component. There is one sink for each test. Each shared function is represented by one internal node. However, in order to represent test coverage information, the so-called xe2x80x9cNoisy-orxe2x80x9d (defined and described in detail in chapter 3 of Finn V. Jensen: xe2x80x9cBayesian Networksxe2x80x9d, Springer Verlag, 1997) construction must be used. The form of test coverage information is such that the memory-saving technique of divorcing (again, see Chapter 3 of Jensen) cannot be used. This means that the
Bayesian network will require an amount of memory exponential in the number of components covered by any test. Even small models exhaust the memory of desktop PC workstations. Clearly this approach is not well suited to embedded or distributed application.
The class of models as applied by the diagnosis engine in accordance with the present invention can be viewed as a subclass of the Bayesian networks. The diagnosis engine in accordance with the present invention can utilize a diagnosis algorithm that can be considered to be an efficient algorithm for belief revision over this subclass. The high accuracy rate of successful diagnosis with the diagnosis engine in accordance with the present invention (as shown later) suggests that this subclass is sufficiently powerful to represent practical diagnosis problems. Furthermore, the relatively structured nature of a model in accordance with the diagnosis engine in accordance with the present invention may be an advantage. when building and supporting a model when compared with free-form construction of a Bayesian network.
Like Bayesian Networks, the diagnosis engine in accordance with the present invention computes the component failure likelihoods exactly. Heuristic methods and Monte Carlo simulation compute approximate likelihoods. The runtime performance of the diagnosis engine in accordance with the present invention is good in practice. It runs about as fast as the aforementioned HP Fault Detective on the same diagnosis problems.
In a nutshell, the diagnosis engine in accordance with the present invention is based on the assumptions that:
1. Component states (that is, whether each component is good or bad) are probabilistically independent;
2. Shared function states (that is, whether each shared function is passed or failed) are probabilistically independent given component states; and
3. Test states (that is, whether each test is passed or failed) are probabilistically independent given component and shared function states.
Assumption 1 is used to compute the probability of any particular set of components being bad and all others being good.
Assumption 2 is used to compute the probability of any particular set of shared functions being failed and another particular set of shared functions being passed given that a particular set of components are bad.
Assumption 3 is used to compute the probability of any particular set of tests being failed and another particular set of shared functions being passed given that a particular set of components are bad (and the rest are good) and a particular set of shared functions are failed (and the rest are passed).
Thus, the diagnosis engine in accordance with the present invention:
1. Specifies the component a priori probabilities, coverages, and shared function coverages.
2. Specifies which tests have passed and which tests have failed. (Some tests may be neither passed nor failed because they were never performed.)
3. Specifies how many components may be simultaneously bad. Call this number N, N being a positive integer.
4. Computes the likelihood that each of the subsets of the components with size less than or equal to N comprises the bad components, whereby
the computation is exact (to within small floating point computation error); and
the amount of memory required to perform the computation is preferably only a constant amount larger than the memory required to store the inputs and the outputs. (xe2x80x9cA constant amount largerxe2x80x9d shall mean an increased amount that is the same independent of the model size and N.)
5. Outputs the likelihoods, either
in human-readable form, or
as computer data available for further automatic processing.
Instead of feature #3, a default value (e.g.xe2x80x941 or 2) could be built in. This reduces the flexibility in using the diagnosis engine in accordance with the present invention without impairing its usefulness much.
The diagnosis engine in accordance with the present invention thus requires an amount of memory less than the amount of memory required to store the model and the amount of memory required to store the output multiplied by a small factor that is a constant independent of the model and output sizes. This makes such a diagnosis engine well suited to use as an embedded diagnosis engine.
It is clear that the diagnosis engine in accordance with the present invention can be partly or entirely embodied by one or more suitable software programs, that can be stored on or otherwise provided by any kind of data carrier, and that can be executed in or by any suitable data processing unit. Combinations of hardware and software can also be used.