There has been significant theoretical research in the area of system level diagnosis. Necessary conditions for system level diagnosability were given in 1967 by Preparata, Metze and Chien; F. P. Preparata, G. Metze and R. T. Chien. On the Connection Assignment Problem of Diagnosable Systems. IEEE Transactions on Electronic Computing EC-16 (12):848-854, December, 1967, and characterized in 1974 by Hakimi and Amin; S. L. Hakimi and A. T. Amin. Characterization of Connection Assignment of Diagnosable Systems. IEEE Transactions on Computers C-23(1), January, 1974. Since that time, there has been a large body of further theoretical developments; E. Kreutzer and S. L. Hakimi. System-Level Fault Diagnosis: A Survey. Euromicro Journal 20(4,5):323-330, May 1987, including the diagnosability of new failure modes; C. L. Yang and G. M. Masson. Hybrid Fault Diagnosability with Unreliable Communication Links. In Fault-Tolerant Computing Systems, pages 226-231. IEEE, July, 1986, and the development of diagnosis algorithms; S. H. Hosseini, J. G. Kuhl and S. M. Reddy. A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and Repair. IEEE Transactions on Computers C-33(3):223-233, March, 1984. Recently, a distributed diagnosis algorithm has been implemented and presented in R. P. Bianchini Jr., K. Goodwin and D. S. Nydick. Practical Application and Implementation of Distributed System-Level Diagnosis Theory. In Proceedings of the Twentieth International Symposium on Fault-Tolerant Computing, pages 332-339. IEEE, June, 1990.
The present invention involves a new distributed diagnosis algorithm, Adaptive DSD, and its implementation. The framework of Adaptive DSD is modeled after the NEW.sub.-- SELF distributed self-diagnosable algorithm given by Hosseini, Kuhl and Reddy; S. H. Hosseini, J. G. Kuhl and S. M. Reddy. A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and Repair. IEEE Transactions on Computers C-33(3):223-233, March, 1984. In that work it is assumed that a node is capable of testing a fixed set of neighboring nodes. It is further assumed that fault-free nodes pass on results of these tests to other nodes in the network. No assumption is made about faulty nodes, which may distribute erroneous test results. Diagnostic messages containing test results flow between neighboring nodes and reach nonneighboring nodes through intermediate nodes. Each node determines independent diagnosis of the network utilizing the diagnostic messages it receives. The NEW.sub.-- SELF algorithm was extended in R. P. Bianchini Jr., K. Goodwin and D. S. Nydick. Practical Application and Implementation of Distributed System-Level Diagnosis Theory. In Proceedings of the Twentieth International Symposium on Fault-Tolerant Computing, pages 332-339. IEEE, June, 1990, by addressing the resource limitations of actual distributed systems. This new algorithm, called EVENT.sub.-- SELF, utilizes "event driven" diagnostic messages to reduce the resource overhead of the NEW.sub.-- SELF algorithm.
The Adaptive DSD algorithm differs considerably from the SELF algorithms in that the testing structure is adaptive and determined by the fault situation. The algorithm handles node failures and repairs. Link failures are not considered in this implementation. The Adaptive DSD algorithm also differs from the SELF algorithms in that the number of nodes in the fault set is not bounded. The SELF algorithms bound the number of allowable faulty nodes to a predefined limit, t. In the Adaptive DSD algorithm, the fault set can include any number of nodes, including all but one as faulty. The remaining fault-free node will correctly diagnose all the other nodes as faulty.
The algorithm is optimal in terms of the total number of tests required. For correct diagnosis, each node must be tested by at least one fault-free node. In the Adaptive DSD algorithm each node is tested by exactly one fault-free node. Each node typically tests one other node, but can be required to test multiple nodes, of which one must be fault-free. In addition, the algorithm requires the maintenance of less complex data structures than the SELF algorithms. All diagnostic information is contained in a single data structure stored at each node.