Automated diagnosis of faults in a system can take many forms. One form of diagnosis is history-based. History-based diagnostic systems form a diagnosis based on a substantial body of historical failure data and symptoms. Gathering historical data requires repeated testing over time and assumes availability of true failure cause (TFC) information. An alternative approach is model-based. Model-based systems compare system responses to expected responses, based on computer models of the unit under test (UUT) or individual components. Model-based test systems are typically complex, often requiring manual entry of a substantial amount of data regarding design and structure. Models may require extensive data from experts or designers having specialized knowledge of the system to be tested. History-based test systems and complex model-based test systems are often impractical or are often not cost effective for UUT's that are in the prototype stage, UUT's that are undergoing frequent design revisions, UUT's that have short lifetimes, UUT's that are low-cost, or UUT's that are produced in limited quantities.
U.S. patent application Ser. No. 08/551,054 to Christopher Preist and David Allport (Preist et al), having the same assignee as the present application, discloses a model-based diagnostic system, based on functional tests, in which the modeling burden is greatly reduced. The model requires only a list of functional tests, a list of components exercised by each functional test along with the degree to which each component is exercised by each functional test, and (if available) the historical failure rate for individual components. The data to be entered may be rapidly and easily determined by test programmers or others familiar with, but not necessarily expert on, the UUT. Typically, the models may be developed by test programmers in a few days to a few weeks depending on the complexity of the UUT. The diagnostic system disclosed by Preist et al is particularly well suited to UUT's in the prototype stage, UUT's produced in limited quantities, UUT's with frequent design changes, UUT's having short life cycles, and UUT's where rapid turn on is important.
The diagnostic system disclosed by Preist et al is especially applicable to diagnosis of failures of electronic components on printed circuit boards. In general, however, the techniques apply to diagnosis of components in any system. For example, components may be printed circuit assemblies or other modules, components may be computers in a network, or components may be electromechanical components in an automobile or airplane. The general concepts are also applicable to medical diagnosis.
The present application deals with automated analysis of particular applications of the diagnostic system disclosed by Preist et al. Before stating the problems to be solved by the invention, a brief description of the diagnostic system is provided below. First, additional detail for some of the data bases is provided. Next, diagnosis is described, with two example methods of assigning weights for ranking candidate diagnoses. Then, the problems to be solved by the invention are described.
In this application, an operation is a process or action carried out by one or more functional tests. For example, a memory test may include a "read memory" operation and a "write memory" operation. Each operation exercises a specific set of components. In this application, the terms "coverage" or "utilization" may be used interchangeably to mean the extent to which a component is exercised by a particular test. Coverages may be specified either numerically (for example, as a percentage or as a fraction between 0 and 1) or categorically (for example, low, medium and high).
In the diagnostic system disclosed by Preist et al, the model comprises the following data structures:
(a) A data base of components and subcomponents. If components or subcomponents have had any prior testing or if failure rate data are available for components or subcomponents, these data are also included. PA1 (b) A data base for mapping raw test results into categorical information (in the simplest case, pass/fail). For example, if the acceptable (pass) range for a particular voltage measurement is 4.5 V-5.5 V, a numerical test result of 4.0 V might map into a fail-low category. PA1 (c) A functional test model in the form of a data base. Tests are defined as lists of operations. Each operation definition specifies each component or subcomponent exercised by the operation and an estimate of the coverage (degree exercised) of the component by the operation. PA1 (d) Failure specifications, including indict lists and acquit lists. This allows the programmer to specify, if a particular failure occurs, a list of candidate components potentially responsible for the failure (indict list) and/or a list of components that must be functioning correctly to a certain degree (acquit list). For example, consider a functional test of an internal combustion automobile starting system. If the engine turns over at a specified RPM but does not start, the fuel system and ignition systems are suspect (indict list) but the battery and starter motor must be good (acquit list). PA1 Where: PA1 where .alpha..sub.i =(one minus utilization of C.sub.j by test i, where test i is a passing test, and C.sub.j is a member of D) or .alpha..sub.i =1.0 when test i fails.
As an example of data for the above data bases, consider functional testing of a printed circuit board. One test is a memory test. Memory test includes two operations: access.sub.-- memory and output.sub.-- to.sub.-- busport. The printed circuit board includes the following components: a central processing unit (CPU), a random access memory system (RAM), a databus, an input/output port (port), and a display module. In addition, the random access memory system includes memory and a decoder as subcomponents. The access.sub.-- memory operation exercises 10% of the functionality of the CPU, 90% of the functionality of the RAM decoder, and 10% of the functionality of the RAM memory. The output.sub.-- to.sub.-- busport operation exercises 50% of the functionality of the databus and exercises 90% of the functionality of the port. The model then includes the following items:
______________________________________ components: CPU RAM databus port display module RAM includes subcomponents: memory decoder Operations: access.sub.-- memory CPU; .1 RAM, decoder; .9 RAM, memory; .1 output.sub.-- to.sub.-- busport databus; .5 port; .9 ______________________________________
The diagnostic system described above uses only data that are readily available at design time. Modeling does not require gathering of historical failure data of the system (although historical failure rates for individual components can be used if available). In the most simple embodiment, the diagnostic system merely requires entry of which components are tested by each operation. In an improved embodiment, the diagnostic system requires entry of the approximate degree to which each component is tested by each operation. This is data that can be provided by a designer or a test programmer. In particular, the diagnostic system does not require structural information (that is, data such as: the output of component A is connected to an input of component B), or failure models (that is, data such as: if test A fails then the most likely cause is component B) or behavioral models (that is, data such as: if both inputs to the NAND gate are high the output is low).
Once the model is defined, the functional tests are executed and failure data are collected. The diagnostic system then determines a diagnosis in three phases. In the first phase, a data abstraction module categorizes each test result as either passing or belonging to one of several possible failure categories. The data abstraction module is not relevant to the present application and is not described in detail here. In the second phase, given the failure results from the data abstraction module, candidate diagnoses are determined. A candidate diagnosis is a minimal set of components, which, if faulty, is capable of explaining all failing test results. Stated differently, every failing test must utilize at least one component in diagnosis D for D to be a candidate diagnosis. The method for determining candidate diagnoses is based on hitting sets, generally described for example in Reiter, R., "A Theory of Diagnosis from First Principles", Artificial Intelligence 32 (1987) 57-95. In the third phase, a relative weight or ranking is assigned to each of the candidate diagnoses. Two example methods of assigning weights are described below, but other types of evidential reasoning may also be used to rank possible diagnoses. Each of the methods described below has the advantage of being computationally efficient.
In the first method for assigning weights, an assumption is made that the probability of a test failing given that a particular component is known to be faulty is proportional to the utilization of that component by that test, and an assumption is made that components fail independently. These assumptions are reasonable in many situations. For example, if all failures are point failures, and all point failures are equally likely, then utilization and the probability of failure are proportional. The assumption leads to the following equation for assigning weights: EQU W(D,R)=p(D)*p(R.vertline.ID) * (operation violation penalty)Equation (1)
D={C.sub.1, C.sub.2, . . . C.sub.M } is a candidate diagnosis (a set of components presumed faulty). PA2 R={R.sub.1, R.sub.2, . . . R.sub.N } is a set of test results. PA2 p(D) is the prior probability of the candidate diagnosis. That is, p(D) is the probability of the components involved in the candidate diagnosis failing given only that some tests have failed (that is, with no information regarding which tests have failed or in what manner). This information is optional and may be omitted if unknown. PA2 p(R.vertline.D) is the posterior probability of getting the set of test results R if the candidate diagnosis D is the set of faulty components. This is calculated from the degree of utilization factors in the functional test model. If more than one failing test is involved, the relevant factors are multiplied together. That is, p(R.vertline.D)=p(R.sub.1 .vertline.D)*p(R.sub.2 .vertline.D) . . . . PA2 (operation violation penalty) is a number between zero and one, used to reduce the weight of an operation that fails in one test, causing the entire test to fail, yet passes in another test. If an operation penalty is appropriate, the value may be set to a system constant used for all operation penalties or variable penalties may be set for each operation. If no operation penalty is appropriate, the operation penalty is set to one (no penalty).
The method for assigning weights described above assumes that component failures are independent, assumes that test results are independent, and assumes that the degree of utilization of a component by a particular test is proportional to the probability of the test failing given that the component is faulty. Even if the assumptions are incorrect, the resulting computed rank order of relative probabilities may still be satisfactorily accurate for some applications. For further explanation of why the embodiment described above may provide satisfactory diagnoses even when the independence assumptions are not true, see Russek, E. "The Effect of Assuming Independence in Applying Bayes' Theorem to Risk Estimation and Classification in Diagnosis", Computers and Biomedical Research 16, 537-552 (1983).
In a second method for assigning weights, useful information can still be derived without making the assumption that the probability of a test failing given failure of a particular component is proportional to the utiliztion of that component by that test. In the second method, weights are computed as bounds on probabilites. A weight W is computed as follows: EQU W(D,R)=p(D)*minimum(.alpha..sub.1, .alpha..sub.2, . . . .alpha..sub.N)*(operation violation penalty) (Equation 2)
The "minimum" function in equation 2 results from the fact that an upper bound of a logical AND of a set of probabilities is the minimum of the probabilities, or more generally, their least upper bound. The probability of the event of all the passing tests given a failure of a component in diagnosis D is therefore bounded by the minimum of the probabilities of individual tests passing (or the minimum of the upper bounds on those probabilities). An additional refinement is made when subcomponents are present. For example, assume that a component C consists of subcomponents A and B. If either A or B fails, then C fails. In estimating a weight for C, weights W.sub.A for subcomponent A and W.sub.B for subcomponent B are computed. A lower bound of the logical OR is the greatest lower bound of this set of weights. That is, the probability that either subcomponent A or subcomponent B or both fail is bounded by the maximum of the lower bounds on the probabilities of failure of the individual subcomponents. Accordingly, the appropriate weight for C is then the maximum of W.sub.A and W.sub.B.
In the diagnostic system disclosed by Preist et al and briefly described above, problems may occur in either the test suite or the model. For example, in the suite of functional tests, some components may not be exercised by any of the tests or may be only partially exercised. If no functional tests fail as a result of component failure or partial component failure, then the component failure will be undetectable. In addition, the diagnostic system may not be able to distinguish one component within a set of components or components may be inadequately distinguishable. Finally, inaccurate modeling data (for example, an incorrect estimate of the degree to which a component is exercised) may result in an incorrect diagnosis.
There is a need for further enhancement of the diagnostic system disclosed by Preist et al by providing automated analysis of the effectiveness of the test suite (ability of the model to detect and differentiate among potential faults), identification of possible test suite changes, and identification of possible modeling errors via automated analysis of incorrect diagnoses. The results of the analysis may then be used by test programmers to improve the particular application of the diagnostic system. In addition, there is a need for further enhancement of the performance of the diagnostic system disclosed by Preist et al by utilizing historical TFC data when such data are available.