The approaches described in this section could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
Some network management systems implement a data model of the managed network, in which programmatic objects represent network elements such as routers and switches, as well as links between the network elements. Other network management systems implement a network management function known as root cause analysis. Typically, a network problem has caused observable changes in attributes and states of entities in the network. As a result, a plurality of events may be emitted by one or more source entities in the network that happen to observe the attribute changes and the state changes caused by the problem.
Under some approaches, root cause analysis may be performed using causality graphs constructed by the events collected. If such approaches converge to a solution within a finite amount of time, the constructed graphs may indicate root causes for problems in the physical network. The existing techniques for root cause analysis, such as those constructing causality graphs using events as input, may take an inordinately long time to converge or fail to converge at all, especially when the number of the events is large. In addition, the techniques may not robustly deal with a situation where key events are missing. Since events are typically collected using unreliable transport protocols such as syslog or a trap mechanism of Simple Network Management Protocol (SNMP), some key events may not reach the network management system.
Some existing techniques configure a time window to disqualify (or remove) all the events outside the window from the root cause analysis for efficiency purposes. However, because network problems and their symptoms propagate at different rates and appear at different times in different locations of the physical network, it is often difficult to configure such a time window properly to realize an objective of excluding irrelevant events while, at the same time, including relevant events.
The problem of finding the cause for an event in the network can be viewed as a search problem. The search can be bounded by time to form a time window in which the cause must exist. However, the search space may be unduly large.
In one prior approach exemplified by software products from IBM Micromuse, an RCA engine performs single hop root cause analysis. Single hop RCA means that the product will not identify a root cause that resides more than one hop from the symptom. For example, if a link goes down, and causes the loss of BGP neighbors in routers which are more than a single hop from the devices connected by that link, Micromuse is unable to detect that the link down is the root cause for the loss of BGP (Border Gateway Protocol) neighbors. In another approach exemplified by EMC SMARTS, a model-based statistical mechanism and a definition language-codebook approach are used.