Modern societies depend on the smooth and error-free operation of large and complex technological systems, such as telecommunication networks and power plants. When failures affect the operation of such large systems, it is important to be able to diagnose the ‘root cause’ of the observed problems. Consider, as an example, a telecommunication network that is used to transport the traffic of different applications. It is a complex inter-connection of many elements, and hence, can fail in many different ways. The failure of a single element, like a transmission link, a router, a server, or a database could affect many network-functions and thus give rise to a multitude of “alarms”, all correlated to the same failure. Similarly, since the successful operation of an application depends on many network elements, an “alarm” could have many different possible causes. Thus, in a complex system, many different symptoms could arise from the failure of a single element and many different element-failures can give rise to the same symptom.
The subject matter of the present inventions pertains to the class of fault diagnosis methods known as ‘model-based’, to denote the fact that they take as their starting point an analytical representation of the underlying Fault Propagation Model that specifies the causal relations between faults and symptoms in the system under consideration. A ‘bipartite graph’ is a convenient representation of the relationship of the Fault Propagation Model. In a bipartite graph there is a set of nodes, one for each object that could fail (and thereby become a ‘fault’), and another set of nodes, one for each symptom or alarm that can appear in the system. An object-node f is connected to a symptom-node s by a link if failure of object f (i.e., fault f) causes symptom s to be observed (in the case of deterministic causation) or if there is a non-zero probability that fault f causes symptom s to be observed (in the case of probabilistic causation). It is assumed that the probability pf of the occurrence of each fault f is known and that the occurrences of the different faults are all independent events. The representation of a Fault Propagation Model by a bipartite graph is well-established in the literature.
The fault-diagnosis problem can be stated as follows: given that a set S of symptoms has been observed, determine the most probable set or sets of faults F whose occurrence would account for the observed symptoms S. If all faults are equally probable, the ‘most probable’ hypothesis is one that contains the smallest number of faults. If faults have different probabilities of occurrence, then the probability of occurrence of a given set of faults is the product of the probabilities of faults in the set and the product of the complement of the probabilities of faults not in the set.
In the most general terms, the task is to determine which of the 2N subsets of the N objects are consistent with all the observed symptoms, and which among them have the highest probability of occurrence. Since the number of possible candidates for solution rises exponentially in N, the procedure of searching for a solution is not scalable, though, in practice, the effort might be reduced by the prior knowledge or assumption that there can be no more than n<<N simultaneous faults in the system (which limits the search to
         (                            N                                      n                      )  possibilities) or by special cases of the structure of the bipartite graph.
For example, in problems where the occurrence of multiple simultaneous faults is known, a priori, to be very rare, a method known as “SMARTS Event Management System Codebook” as described by S. Kliger, S. Yemini, Y. Yemini, D. Ohsie, and S. Stolfo, in “A Coding Approach to Event Correlation”, Proceedings of the fourth international symposium on integrated network management, pp. 266-277, 1995, and in U.S. Pat. No. 5,661,668, entitled “Apparatus and Method for Analyzing and Correlating Events in a System using a Causality Matrix”, issued Aug. 26, 1997 relies on associating a unique ‘code’ of symptoms with each of the fault-occurrences chosen for consideration in the system. Here, the bipartite graph of the fault-to-symptom mapping is expressed by an M×N matrix F of 1's and 0's, where M is the number of possible symptoms and N is the number of (independent) objects (which, upon failure, become faults), and the element fij (in the deterministic case) is given by
      f    ij    =      {                            1                                      if            ⁢                                                  ⁢            symptom            ⁢                                                  ⁢            i            ⁢                                                  ⁢            is            ⁢                                                  ⁢            present            ⁢                                                  ⁢            when            ⁢                                                  ⁢            fault            ⁢                                                  ⁢            j            ⁢                                                  ⁢            occurs                                                0                          otherwise                    
Thus, column j of F, say fj, is a vector of alarms that is viewed as a “codeword” for fault j. The “codewords” for the different faults must be distinguishable one from another; otherwise, there would be faults that produce identical alarm vectors, which must, hence, be regarded as “equivalent”. Instead of working with an entire column as a codeword, it is possible to work with a subset of the rows (symptoms) of F and still maintain the uniqueness of the codewords. On the assumption that there can be, at most, a single fault, in the absence of errors, the alarm vector either has all zeros or matches one of the codewords exactly. However, to guard against inexact matches due to erroneous or “lost” alarms, in selecting a subset of the symptoms to work with, one tries to produce codewords with a minimum pair-wise separation (Hamming distance) so that an alarm vector, when it fails to match any codeword exactly, can be assigned to the codeword to which it is closest.
M. Steinder and A. S. Sethi, in “Probabilistic fault diagnosis in communication systems through incremental hypothesis updating”, Computer Networks 45, pp. 537-562, 2004, consider the diagnostic problem for the case when the coupling between objects and symptoms in the bipartite graph is allowed to be probabilistic, and present a Bayesian inference algorithm in which certain approximations are used to limit the number of computations for finding a solution.
As noted earlier, without assumptions that limit the number of possible simultaneous faults, the number of hypotheses to be considered in diagnosing the root cause of a set of observed symptoms grows exponentially in the number of potential faults (objects). This rate of growth in complexity limits the size of the problems that can be solved by means of direct, centralized computation. An approach to slowing the rate of growth of complexity of diagnostic calculations is to partition the problem in some fashion into a number of ‘computational domains’ such that the calculations for the sub-problem in each domain can be carried out in parallel, i.e., centralized computation is replaced with distributed computation in the domains. Some coordination might then be needed among the results from the domains in order to arrive at a solution to the overall problem.
U.S. Pat. No. 6,868,367, entitled “Apparatus and Method for Event Correlation and Problem Reporting”, issued Mar. 15, 2005, describes the case of multiple domains, with the assumption that, in each domain, it is very rare to have more than one fault. The diagnostic method appears to consist of a ‘pooling’ of the solutions of the local domains. Other methods for coordinating such distributed computations, based on an exchange (either one-shot or iterative) of ‘cost’ information among the domains, have been proposed by A. T. Bouloutas, S. B. Calo, A. Finkel, and I. Katzela in “Distributed Fault Identification in Telecommunications Networks”, Journal of Network and Systems Management. 1995; and by M. Steinder and A. S. Sethi, in “Multi-domain diagnosis of end-to-end service failures in hierarchically routed networks”, IEE Transactions on Parallel and Distributed Systems, vol. 18, no. 3, pp. 379-392, March 2007.