Event and alarm correlation is a well known technique in network management. An event correlation algorithm may determine a series of clusters of events that are likely to be related to each other by combining methods that take into account several properties of the events—such as the time when the events originated, time when the events were received by a network management node (or management station or similar), location where the event or alarm was generated, topology information about the network, etc. From a network fault management perspective, the event correlation is an essential step towards determining a root cause defect that is responsible for events within such cluster.
An important feature in event correlation and root cause analysis is the correct size of an event correlation time window. An event correlation time window is a specified time period during which event information received from various places in a network is collected and stored in a memory of a network management node or similar. After an event correlation time window expires, events received during this time window are analyzed and used in determining a root cause for these events. Commonly the event correlation time window is set to a fixed size which is overlapped in continuous manner over the stream of events in order to select the events of potential interest. If the time window is large it may impose unnecessary requirements in terms of memory or processing power on the network management node performing the analysis. If the time window is small it may instead exclude events which would be of use during the root cause analysis.
A small degree of adaptability of the event correlation time window is introduced by Maitreya Natu and Adarshpal S. Sethi in “Using temporal correlation for fault localization in dynamically changing networks” Int. J. Netw. Manag. 18, 4 (August 2008), 301-314. Natu and Sethi suggest setting the size of the window to the time between two consecutive topology updates in case of frequent topology updates. In case of infrequent topology updates it can be set to some minimum time for a change to be reported to a manager.
In “Dynamic Adaptation of Temporal Event Correlation for QoS Management in Distributed Systems.” (Short paper in 14th IEEE International Workshop on Quality of Service, June 2006), authored by Rean Griffith, Joseph L. Hellerstein, Gail Kaiser, and Yixin Diao, an approach that takes propagation delays into account is proposed. The proposal includes a system to measure actual delays, a component that estimates propagation delays in a statistical manner, and a controller that updates temporal rules associated to events based on the above-mentioned information. The method proposed can account only for fairly simple changes in the temporal patterns of the propagation. Further, the algorithm disclosed works well when propagation skews are independent and identically distributed. However, in metro or wide-area transport networks, it is likely that a problem resulting in re-routing would cause propagation delays that are strongly dependent on the topological location of the problem.
Wu, Mao, Rexford and Jian “Finding a needle in a haystack: pinpointing significant BGP routing changes in an IP network. In Proceedings of the 2nd conference on Symposium on Networked Systems Design \& Implementation—Volume 2 (NSDI'05), USENIX Association, Berkeley, Calif., USA, 1-14” propose a mechanism for determining a correlation window based on combining a fixed time interval with and a maximum number of events that have to occur during this interval. The time interval is set, as a constant, according to particular characteristics of the routing system. The maximum number of events is also set according to a heuristic method. The proposal described in relies on a heuristic estimation of the control parameters. As such, it is difficult to adapt the method to a particular network configuration without having expert knowledge on how the method works and how the overall network properties need to be reflected in the heuristic.
Other approaches to determining the size of the event correlation time window includes adapting the size in depending on the events and sequences of events received by a management node. E.g. U.S. Pat. No. 7,661,032 B2 describes a window-resizing module as part of their event correlation system. Their proposal is based on an algorithm that, given a current event it recognizes this event as part of a larger symptom, and thus anticipates a future event that might occur as part of the same symptom at a future time and automatically extends the size of the correlation window to take into account this future event. This approach requires large a-priori knowledge on the events and sequences of events that are part of a symptom.
All the above-mentioned methods for setting the size of an event correlation time window are thus associated with one or more disadvantages.