Network Layering and Topology
A fundamental principle of communication networks concerns interactions that take place along vertical and horizontal planes: layering and topology, respectively. The principle of layering has been clearly articulated in the ISO/OSI reference model, which implies vertical relationships in which the network entities of layer N are interacting with network entities of layer N−1. At the same time, topology implies horizontal relationships, in which two or more entities at layer N are logical or physical peers. A network can thus be abstracted as a graph, consisting of nodes (communications entities) and edges (representing topological and layering relationships). In many scenarios, communications entities of several communications layers co-reside in the same device. However, these entities are still clearly distinguished for management purposes, such as with physical ports and their logical interfaces.
A network management system can represent a network topology using a topology model that consists of both logical and physical objects. The logical objects are (a) logical or virtual protocol objects, such as TCP ports or connection end points, and (b) association objects that model relationships between objects, such as a connection or a route peer. A fault occurring in a network, such as an interface failure, may potentially affect any other object in the network, thus potentially causing other events to occur, or issue, at those other objects. This is because the other objects have recognized a symptom of the fault that is the original event.
Because of the multiple relationships and interdependencies between entities in a network, many abnormal events occurring during network operations have ripple effects across the network. An event happening at a layer N may have a chain-reaction in the vertical direction, from N to N+1; and in the horizontal direction, along the peers at layer N. N:N+1 and N:N chain-reactions may propagate recursively. The same underlying root cause can thus cause a multitude of events to be issued from interacting entities across the network. Hence, an event issuing at a given entity can be related to events issuing in entities in both horizontal and vertical directions across the network. Furthermore, as a fault propagates through the network, the events or alarms issued at affected nodes do not have the same data elements, data format, or content.
In general, a system event can be described as a state transition of a component of the system. In the context of a communications network, an event can be described as a state transition of a communication entity in the network, such as a router port or a logical interface. More specifically, in the context of a fault event in a communications network, a fault event can be described as a state transition of a network entity from a normal state to a faulty state.
Event Correlation
Event correlation is an important function in fault management systems, to identify events that are likely triggered by the same root cause. Event correlation is used to analyze and pare down significant numbers of events that might otherwise inundate users and applications, to more quickly and effectively take appropriate action in response to the root cause.
One approach to determining what other events are related to an event issued at a failed entity involves traversing a topology graph, which represents the network topology using a graph of interconnected nodes with the interconnections representing logical or physical dependencies between the entities represented by the nodes. However, such a brute-force traversal is not optimal and practical.
Hence, one common problem with event correlation involves handling the combinatorial explosion of event combinations that might be correlated and, therefore, having correlation algorithms that work properly with thousands or millions of events. The challenges regarding scale are a significant issue in any systems that are based on an inference engine or rule processing.
For example, with rule-based systems without a topology model, event correlation rules are encoded using a rule language, such as Prolog, CLIPS, or others. The possible cause-effect relationships between various events (such as E1 causes E2) are enumerated in advance as much as possible and encoded in the rules. At run time, the rule engine correlates input events by traversing (or “inferencing” on) the implicit cause-effect graph. Transitive relationships need not be encoded since the rule engine is able to deduce those relationships during the traversal or inference process.
With topology-based systems, an explicit topology model (i.e., a graph) of the network is used to correlate events happening in the objects (i.e., nodes) in the model. There are two major approaches to event correlation in topology-based systems: (a) approaches based on event propagation models, and (b) approaches based on heuristics. With the event propagation model approach, the topology model is augmented with event propagation “rules” or statements. The rules state how events are propagated along associated objects of the topology, leading to inference chains that fire along the presence of events and relationships of affected objects. With the heuristics-based approach, appropriate domain knowledge is employed to correlate events occurring in the objects of the topology model.
Various commercial fault management systems support event correlation, with most of these systems supporting causal event correlation. With some approaches, a user is required to write, and maintain, complex causal propagation “rules” using the language provided with the associated software development kit. In other deductive rule-based fault management systems, a user is required to write complex (AI-type) rules. Even if the fundamental rules are packaged with the system, a user is still required to write the deductive “rules” that apply to the relevant network. With each approach, correlation occurs through inferring, instantiating rules, and traversing search spaces that grow polynomially, if not exponentially, with the number of events and the size and complexity of the network.
Based on the foregoing, there is a clear need for a more efficient and scalable event correlation technique that exploits knowledge of relationships between entities in a network to restrict the correlation search space.
The approaches described in this section could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.