This invention is related to a method for automatically isolating item faults in a system, and more particularly, this invention is related to a method for automatically isolating item faults by using matrices to identify which have failed.
Fault isolation techniques and related methods of automatically detecting faults (such as for line replaceable units or least replaceable units (LRU)) are becoming increasingly important to maintain complicated apparatus and communications equipment in proper operating order. One of the drawbacks of conventional fault isolation techniques is the use of software that is specific to a general system, such as a communication system. For example, in the prior art fault isolation technique labeled FIG. 1, four line replaceable units are illustrated at 200a-d as part of the equipment forming a general communication system indicated by the dotted line at 22. The line replaceable units 200a-d output to respective line replaceable units as illustrated. The prior art communications equipment includes a sensor 24 that would sense respective a line replaceable unit 20d and means to conduct a subtest and generate performance data to an automatic fault isolation software or hardware module 26, which would not be part of the communication equipment. A software module would include the appropriate processor having specific software with numerous software branches that depend on hardware and system specific algorithms.
The sensor 24 typically checks the status of equipment and outputs performance data. Often, a line replaceable unit 20 may generate its own status bit, such as an xe2x80x9conxe2x80x9d or xe2x80x9coffxe2x80x9d bit, which indicates a problem has occurred in a line replaceable unit. Any automatic fault isolation software typically is a complicated software code with many branches. It is costly to develop, test and maintain. Additionally, this complicated software is difficult to reuse because the specific case branches depend on the specific hardware, which naturally is different in each system. In the prior art process, the automatic fault isolation software or hardware module 26 identifies the failed line replaceable unit, and then outputs a status report to an operator.
FIG. 2 illustrates a basic problem showing why automatic fault isolation techniques are difficult. Individual line replaceable units are shown as hardware modules, and given reference numerals 30A-30H. For example, both hardware modules 30A and 30B are required to provide necessary signal inputs to the other modules 30C, 30D and 30E, which in turn are directly responsible for providing input signals to respective hardware modules 30F, 30G and 30H. Thus, a failure of either module 30A or module 30B can cause the observed test failures. FIG. 3 shows a diagram similar to FIG. 2, but having module 30E input signals to modules 30H and 30I. It is evident that one hardware module failure can make several subtests fail, and different hardware module failures can cause similar patterns of subtest failures.
Many prior art devices and software systems have been developed to aid in diagnostic testing of different hardware, using status checks and software testing. Examples include the following systems and apparatus disclosed in the following U.S. Patents.
U.S. Pat. No. 5,652,754 to Pizzica discloses a signature analysis usage limited to 100% digital systems for fault isolation. It derives a digital signature by a sequence of binary operations on the digital outputs of the module, in response to a sequence of digital inputs to the module. It does not consider any other electrical or physical variables. It is disclosed for single electrical node short or open failures, and other node failure types without substantiation, but does not disclose multiple node failures or failures internal to components. This system typically cannot be applied during normal system operation, and relies on an external test input signal source for a predetermined sequence of input stages. It requires disconnection of module inputs from their operational configuration, and connection instead to the external test input signal source. It looks for exact matches to predetermined signatures derived from digital module outputs, rather than partial matches to lists of symptoms caused by component failures. It makes no attempt to find an explanation for the greatest number of observed symptoms.
U.S. Pat. No. 3,813,647 to Loo discloses a system that does a window comparison on a sequence of analog or pulse rate test points, with branches in the sequence depending on any values out of tolerance. It looks first at high level summary indications, and then in more detail at a subset of equipment where a high level fault was found. It also includes signal generators to provide synchronized stimuli to the system under test during the sequence. It seems to flag only a single fault, the first that it arrives at in the preprogrammed sequence. This method requires stimulating the system under test with predetermined inputs from signal generators or equivalent.
U.S. Pat. No. 3,787,670 to Nelson discloses a system that considers various physical variables such as punched card movement, voltage, temperature, pressure, etc. to be converted to digital signals using an A/D converter. A major feature is the use of photo cells for sensing punched cards. Performance of various subunits would be measured and monitored relative to correct performance. The system compares quantized digital values to reference values, but without specifying high and low limits, and a central processor is used for the comparisons and decisions. The system identifies the location in the equipment where the bad data value occurred.
U.S. Pat. No. 4,142,243 to Bishop et al. discloses a system for fault detection using checksums on a digital computer. The system checks for faults first in large sets, then checks in subsets contained in the large set where a fault occurred. The system assumes a simple one to one correspondence of checksum memory locations and hardware fault locations, and a computer under test to follow a predetermined sequence of digital states so that measured and pre-stored checksums match. The technique does not apply during normal online operation.
U.S. Pat. No. 5,157,668 to Buenzli, Jr. et al. discloses a fault isolation system that uses reasoning techniques, minimal test patterns, component modeling, and search strategies. Modeling uses xe2x80x9cbehavioral constraintsxe2x80x9d which include xe2x80x9cphase constraints,xe2x80x9d gain constraints, compliances, tolerances, and saturation. xe2x80x9cPhase constraintsxe2x80x9d equals IF inputs change, and outputs change in the plus/minus direction. The system uses recursive top-down hierarchical search strategy starting with highest level modules, and isolates low-level circuit components. The system also uses a library of circuit component models, and starts with normal functional tests, requiring detailed functional test plan. It also assumes that an operator already knows which module is tested by each functional test. The system also assumes electric schematic, CAD models, CAE models, and layout diagrams are available, and typically it would be difficult to apply during normal system operation. The system relies on an external test input signal source for a predetermined sequence of input states, and disconnects module inputs from their operational configuration, and connects instead to the external test input signal source.
U.S. Pat. No. 5,581,694 to Iverson discloses a fault isolation system that receives data samples, indicating the operating status of units in a system, and analyzes this data to identify unit failures, using a knowledge base (in a data structure) about the equipment. The knowledge base describes signal flow through the system, and the technique considers failure propagation along this signal flow.
It is therefore an object of the present invention to provide a method for automatically isolating item faults that can use standard fault detection status inputs and subtests and work both xe2x80x9conxe2x80x9d and xe2x80x9coffxe2x80x9d line for either start-up or operational automatic fault isolation.
It is still another object of the present invention to provide a method for automatically isolating item faults that can identify all potentially failed items and use a software that does not have many program branches.
In accordance with the present invention, the present invention is advantageous because it provides a method and software implementation for automatic fault isolation that uses data matrices that can identify which item failures cause which status input and subtest failures. The present invention also identifies specific subtest successes to prove that specific items are operative, and identifies in ranked order any items whose failure can explain the greatest number of subtest failures. The present invention breaks ties using expected failure rate data or other criteria for which suspect item should be replaced first, such as ease of replacement, availability of spare modules and other details.
The method and software implementation of the present invention is also advantageous because it can identify all potentially failed items, and it can be implemented in a simplified software or circuit. Because the hardware to be fault isolated is different each time, the software of the present invention does not have to be unique to the hardware and rewritten each time. Thus, any software would not have to be lengthy and complicated with special case logic for each item.
The present invention is also advantageous because it separates the system-specific data into a simple tabular data structure, making the resulting actual software code short, generic, maintainable and reusable. The claimed invention can be used to provide automatic fault isolation in cars, airplanes, computers, communication systems, machinery and many other different products known to those skilled in the art. In accordance with the present invention, a method of automatically isolating item faults includes the steps of obtaining subtest results T[i] for a plurality of items where T[] comprises a vector of m subtest results and T[i] greater than 0, if subtest i fails and if subtest i does not fail, T[i]xe2x89xa60. The method also comprises processing the subtest results T[i] with respective matrix values Y[i,j], where Y[] comprises a predetermined mxc3x97n matrix. Y[i,j] greater than 0 if item j causes subtest i to fail, and Y[i,j]xe2x89xa60 if item j does not cause subtest i to fail. The results are summed in order to obtain a number S[j] for each item that is reflective of the number of subtest failures that are explained by a failure of item j, wherein any item j is suspect if S[j] is greater than 0. S[j] reflects the degree of matching or correlation of the actual failure pattern with the expected failure pattern for item j.
In accordance with the present invention, the method also comprises the step of determining which item has the most likely failure by determining the largest value of S[j] when S[j] has multiple non-zero entries. The largest value of S[j] is indicative of the most likely single item failure, where any other items with non-zero S[j] represent alternate possible item failures. Subtest results T[i] and matrix values Y[i,j] are correlated, compared, vector multiplied and summed, or vector ANDed and summed.
In accordance with still another aspect of the present invention, a probability factor FP[i] is compared when there are two or more numbers S[j] that have substantially the same largest value to determine which item has the most likely failure. The probability factor FP[i] comprises estimated failure rates or the ease of replacement of respective items. A raw number Sraw[j] reflective of all observed subtest failures is initially obtained and processed with variables sumTN[] that rule out certain item failures to obtain S[j].
The method obtains subtest status Trun[] for a plurality of n items where Trun[] comprises a vector of m status values and Trun[i]=1 if the result of subtest i is known, and Trun[i]=0 if the result of subtest i is unknown.
The method first correlates or compares the subtest results T[] with matrix values Y[], where Y[] comprises a predetermined mxc3x97n matrix where Y[i,j] greater than 0 if failure of item j can cause subtest i to fail and Y[i,j]xe2x89xa60 if failure of item j does not cause subtest i to fail.
For each i, the method can set a value Tmod[i]=a positive value if Trun[i]=0, and Tmod[i]=T[i] if Trun[i]=1, where Tmod[] comprises a vector of m subtest results and Tmod[i]=the degree to which subtest i fails or passes, with Tmod[i] greater than 0 for failure and Tmod[i]xe2x89xa60 for success.
The method can set each value SumN[j]=a positive value if column or row j of matrix N corresponding to item j has any nonzero entries, and SumN[j]=0 if column or row j of matrix N has no nonzero entries, where SumN[j]=a positive value if SumTN can be used to prove that item J is ok, and SumN[j]=0 if SumTN cannot be used to prove that item j is ok.
The method can also correlate or compare second matrix values N[i,j] with the subtest results Tmod[i], where N[] comprises a predetermined mxc3x97n matrix where N[i,j] greater than 0 if failure of item j causes subtest i to fail and N[i,j]=0 if failure of item j does not cause subtest i to fail.
The method can obtain from the second correlation of comparison a value SumTN[j] for each item that reflects a weighted number of subtest failures that are explained by a failure of item j, or the degree of similarity of T[] to the minimum failure pattern expected from a failure of item j, wherein any item j is not suspect if SumN[j] greater than 0 and SumTN[j] less than a positive threshold.
The method can set each value S[j]=0 if SumN[j] greater than 0 and SumTN[j] less than a positive threshold, or otherwise S[j]=Sraw[j], where S[j] for each item j reflects the number of subtest failures that are explained by a failure of item j, or the degree of similarity of T[] to the maximum failure pattern expected from a failure of item j, wherein any item j is suspect if S[j] is  greater than 0.
The method can obtain a predetermined probability factor FP[j] for each item j, where FP reflects relative item failure rates, or ease of item repair or replacement, or other criteria for breaking ties in values of S[] and replacing which item first.
The method can determine which item x has the most likely primary failure by determining the largest value of S[j] when S[j] has multiple nonzero entries, and breaking ties based on the largest value of FP[], such that the largest value of S[j], or S[j] and FP[j] for ties, is indicative of the most likely single primary item failure where any other items with S[j] above a threshold represent alternate possible secondary item failures.
The method can obtain from Y[] the column or row Yx[] corresponding to item x, and set T[i]=0 for all i where Yx[i] greater than 0, and repeat all steps up to this point until all entries of T[]=0, where more than one item may be identified as a primary failure.
The system analyst can predict by analysis and verify by experiment which item failures may cause which subtests to fail. Let m be the number of subtests and n be the number of items. Let Y[] be a predetermined mxc3x97n matrix where Y[i,j] greater than 0 if item j can cause subtest i to fail, and Y[i,j]xe2x89xa60 otherwise. For purposes of description, the first value will be described throughout as a 1. The second value will be described as a 0. Let T[] be a vector of m subtest results where T[i] greater than 0 if subtest i fails, and xe2x89xa60 otherwise. Then if subtest i fails, any item j for which Y[i,j]=1 is suspect. In general,       S    ⁡          [      j      ]        =            ∑              i        =        1            m        ⁢                  Y        ⁡                  [                      i            ,            j                    ]                    *              T        ⁡                  [          i          ]                    
equals the number of observed subtest failures that are explained by a failure of item j. Any item j is suspect if S[j] greater than 0.
Majority Voting
By using fractional values in N and comparing the result to a non-zero threshold, a single matrix N can provide: (1) single subtests that can each prove an item is okay; (2) groups of subtests that together can prove an item is okay; and (3) a majority vote of subtests that can prove an item is okay or suspect.
To handle cases where items can be proved acceptable if a combination of subtests all passes, let N[] be a predetermined mxc3x97n matrix where N[i,j]=1 if subtest i is part of a set of subtests that together can prove j is acceptable, and 0 otherwise. Then       Sum    ⁢                  xe2x80x83            ⁢              xe2x80x83              ⁢          N      ⁡              [        j        ]              =                    ∑                  i          =          1                m            ⁢              N        ⁡                  [                      I            ,            j                    ]                       greater than     0  
if some set of subtests can prove item j is acceptable, and if             Sum      ⁢              xe2x80x83            ⁢              TN        ⁡                  [          j          ]                      =                            ∑                      i            =            1                    m                ⁢                              N            ⁡                          [                              i                ,                j                            ]                                *                      T            ⁡                          [              i              ]                                           less than       1        ,
then all subtests in this set pass and item j is acceptable. If item j is acceptable, the system sets S[j]=0. If             Sum      ⁢              xe2x80x83            ⁢              TN        ⁡                  [          j          ]                      =                            ∑                      i            =            1                    m                ⁢                              N            ⁡                          [                              i                ,                j                            ]                                *                      T            ⁡                          [              i              ]                                          ≥      1        ,
then at least one subtest in the set failed so the item is suspect.
To handle cases where items can be proved acceptable if any one of a set of P subtests passes, let N[i,j]=1/P if subtest i alone can prove item j is okay, and 0 otherwise. Then if             Sum      ⁢              xe2x80x83            ⁢              TN        ⁡                  [          j          ]                      =                            ∑                      i            =            1                    m                ⁢                              N            ⁡                          [                              i                ,                j                            ]                                *                      T            ⁡                          [              i              ]                                           less than       1        ,
at least one subtest in this set passed and module j is acceptable. If module j is acceptable, the system sets S[j]=0. If             Sum      ⁢              xe2x80x83            ⁢              TN        ⁡                  [          j          ]                      =                            ∑                      i            =            1                    m                ⁢                              N            ⁡                          [                              i                ,                j                            ]                                *                      T            ⁡                          [              i              ]                                          =                        P          /          P                =        1              ,
then all of the P subtests in the set failed so the module is suspect.
A fractional value of N[i,j] also provides the majority voting capability. This is useful where multiple devices connect to and depend on the same signal, such as a power supply or serial port. If a majority of the devices are working, the power supply or serial port must be acceptable. But if a majority of the devices show power-related or communication faults, the power supply or serial port is suspect. The threshold can also be set to an application-dependent value rather than a simple majority. In either case, let N[i,j]=1/P if fewer than P subtest failures prove that module j is acceptable, and 0 otherwise. Then if             Sum      ⁢              xe2x80x83            ⁢              TN        ⁡                  [          j          ]                      =                            ∑                      i            =            1                    m                ⁢                              N            ⁡                          [                              i                ,                j                            ]                                *                      T            ⁡                          [              i              ]                                           less than       1        ,
fewer than P subtest failures have occurred and module j is acceptable. If             Sum      ⁢              xe2x80x83            ⁢              TN        ⁡                  [          j          ]                      =                            ∑                      i            =            1                    m                ⁢                              N            ⁡                          [                              i                ,                j                            ]                                *                      T            ⁡                          [              i              ]                                          ≥      1        ,
at least P subtest failures have occurred and module j is suspect. Because the devices that depend on the same power supply or serial port may each have several subtests, the value 1/P may occur in N[j] more than P       Sum    ⁢          xe2x80x83        ⁢          TN      ⁡              [        j        ]              =            ∑              i        =        1            m        ⁢                  N        ⁡                  [                      i            ,            j                    ]                    *              T        ⁡                  [          i          ]                    
times and
may be greater than 1. For complex cases, such as a multi-output power supply where P items use one output and Q items use a different output, some N[i,j]=1/P while others equal 1/Q.