Fault isolation in network management systems is a difficult and often inefficient task. Fault isolation attempts to identify networked system entities that are not operational, or “down,” and are a root cause of a potential larger networked system outage. As used herein, an entity is a device, process, or other resource in a networked system that is under management of, or otherwise tracked or modeled by, a network management system. Some entities that are tracked or modeled may be entities that the network management system is unable to directly obtain operational status information from, such as network cable links as opposed to network interconnection devices such as routers. Other examples of an entity that the network management system is unable to directly obtain operational status information from may include servers, processes, and hardware maintained by external organizations and “dumb” devices that have limited or no Simple Network Management Protocol (SNMP) communication capabilities. Such an entity is referred to herein as a “logical entity.” An entity that a network management system is able to directly obtain operational status information from is referred to as a “physical entity.”
Network management systems typically maintain a model of the logical entity that includes a last known operational status of the logical entity. The operational status of the logical entity is inferred through the operational status of physical entities that neighbor the logical entity within a larger networked system topology. Each physical entity also includes a model in the network management system that maintains an operational status of the respective physical entity. Through the status of the neighboring physical entities as represented in the physical entity models, a status of the logical entity may be inferred. For example, if all of the neighboring physical entities of the logical entity have a status of “up,” the logical entity may be inferred to have a status of “up.” Conversely, if all or a majority of the neighboring physical entities have a status of “down,” the logical entity may be inferred to have a status of “down.” However, in an instance where all of the neighboring physical entities have a status of “up,” but a fault is detected with regard to the logical entity, the status of the logical entity may be “down” and an inference may be drawn that the logical entity is the root cause of the fault.
The difficulties and inefficiencies in fault isolation by network management systems arise in instances such as when the network management system detects that it has lost contact with a logical entity. For example, upon detection of a fault with regard to a logical entity, the network management system will trigger a fault isolation process to identify the status of physical entities and infer the status of logical entities. In such a process, the network management system will send messages from the logical entity model for which the fault was detected to the models of its neighboring entities. The models of the neighboring entities will receive their respective message, check their own status, such as by querying a physical device represented by a physical entity model, and if up, the entity then sends a message to its neighbors inquiring about their status. In such instances, the neighbors of the physical entity model including the logical entity model for which the fault was detected, often receive a second message inquiring about their status. As a result, the status inquiry messages that originate with the logical entity model may end up being repeated many, many times. This creates excessive inter-model processing within the network management system which consumes processing resources. Further, when physical entity models perform a status query of their respective physical entities, considerable traffic may flood the organizational network. As a networked system is scaled up, such fault isolation techniques become more and more resource intensive increasing latency within networked systems and network managements systems.