Diagnosing the causes of network faults is the first step in their correction, either by restoring the network to full operation or by reconfiguring the network to mitigate the impact of a fault. Fault diagnosis is relatively straightforward if one has information about the state of all active elements in the network. Unfortunately, such monitoring information is usually transmitted through the network itself to a network management system (NMS) that processes the information for diagnosis. Thus, the existence of a fault can impede the gathering of information about the fault and other faults in the network. This problem can be reduced, but not solved, by distributing NMS functionality through the network. While distributing fault diagnosis among multiple NMSs allows for multiple perspectives, coordination between the NMSs also can be impacted by the same network faults they are trying to diagnose and repair. However, distributing the NMS capabilities also allows for various domains of the network to be managed autonomously, thus avoiding the problem of a single point of failure.
The most common approach to network fault diagnosis is known as the fault propagation approach, which leverages a model of how faults propagate through a network. For example, the failure of a network interface can effectively sever communication to its entire device, thus creating the appearance of secondary faults in the device. A fault propagation model of a complete network is often constructed from the modeled behavior of network elements and the network layout. Once constructed, the propagation model is used during live fault diagnosis to reason about the network monitoring data that are available to an NMS, such as SNMP queries and traps, periodic polling, and other mechanisms that allow monitoring of the state of various network elements. Based on the results of the reasoning over the fault propagation model, the fault is localized.
Various approaches to solve the diagnosis problem have included expert systems, neural networks, fuzzy logic, max product algorithm, petri-nets, and so on. A number of groups have used variants of the fault propagation approach with a variety of model types including dependency graphs, causality graphs, coding approaches, Bayesian reasoning, and even phase structured grammars. Also, Boolean variables have been used, but they focus primarily on dealing with reasoning about reachability from the observer, i.e., the NMS. However, these techniques are quite complex. A simpler fault propagation approach is needed, such as one that enables enumeration of all possible alternative diagnostic explanations of possible faults that would give rise to the communication network monitoring results being considered.