Distributed on-line diagnosis methods (algorithms) are known. For example, in one such system each node of a distributed system is capable of diagnosing the state of all system resources, based on locally maintained information. This method operates correctly in the presence of dynamically occurring fault events. This method is prohibitive for implementation in practical systems due to high overhead. The overhead includes inter-node testing and messages required to distribute diagnosis information.
Adaptive testing methods have addressed the costs of redundant tests required to accommodate multiple faults with a fixed testing assignment. An adaptive testing method was developed that is executed by a central observer and issues only those tests required for diagnosis. In another method, a distributed adaptive testing was devised where testing decisions are made locally by the nodes of a distributed network. The former method executes off-line, requiring that no fault events occur during algorithm execution and the latter method requires a fully connected network. The latter method requires the minimum overhead to perform the system-level diagnosis task.
The latter adaptive method is implemented in a network of over 200 workstations at Carnegie Mellon University. By distributing its execution to the fault-free workstations, it has executed continuously for over 1.5 years, even though no single workstation was fault-free for the entire period. See U.S. Pat. No. 5,325,518 assigned to the assignee of the present invention.
Recently, a method was presented for on-line execution in arbitrary topology networks. Additionally, considerable work has been done in other distributed methods that can be applied to distributed diagnosis, including leader election. Leader election algorithms are based on distributed spanning tree construction. That work has resulted in several algorithms with lower complexity. However, these diagnostic algorithms require a stable network environment during execution and are thus not directly applicable to on-line diagnosis.
Accordingly, it is an objective of the present invention to provide on-line adaptive distributed diagnosis in arbitrary networks in the presence of both node and link failures. It is a further objective of the present invention to provide a diagnostic method which has lower overhead and better execution bounds.