The Fault Management discipline in network management systems comprises sets of functions enabling the detection, isolation and correction of abnormal operations in the communication network and its environment. Abnormal operations may relate to events such as physical resource failures (e.g. link outage), communication failures or security violations occurring in the interconnected nodes forming the network.
Functions associated with Fault Management provide, among other ones, the report of alarms which requires on one hand the detection of failures and the report of alarms by the nodes, and on the other hand the presentation of the information related to said failures to network operators. Network operators are responsible for ensuring that the network provides services that users are expecting. This responsibility depends on real time advertisement of network abnormal operations so that appropriate recovery actions can be taken. In order to fulfill this duty, network operators rely on Fault Management, first to be informed on the failure occurrence, and secondly to have correlated fault information on that failure. Fault correlation requires that those resources which are functionally affected by the failure are registered together with the failure and that this correlated information is accessible by the network operators
In current networks both, different characteristics and different solutions can be found. Essentially two characteristics of the prior networks have evolved and obstructed the approach of Error correlation:
the current bandwidth available on a given network interface has limited de facto the amount of logical resources served by the physical media, and PA1 the logical resources were tightly linked to physical resources which made the network topology very static. PA1 information on all the resources affected by a failure in the network with asynchronous notifications raised from the network to the network management system, PA1 a posteriori (i.e. after failure occurrence) retrieval of information on a given resource (e.g. verification of the status of a resource), PA1 a posteriori verification of the valid connections. PA1 one failure will have disruptive effects on a larger amount of applications and users, and PA1 one failure will trigger many alarms in the network, related with the affected logical resources PA1 the need to correlate a physical resource failure with the logical resources which were previously served by this failed entity, and PA1 the need to restrict the overall fault management flow to avoid excessive network bandwidth utilization for network management purpose. PA1 informing the operator of the failure of the physical resources, and PA1 keeping the information inside the node for on-request retrieval. PA1 the network operator configures each node in the network to enable logging of required information in a memory of the Network Element, PA1 the physical resource triggers an alarm when it is affected by a failure, PA1 each affected logical resource logs failure information on reception of the physical alarm, and PA1 the operator requests log retrieval for analysis. PA1 on the physical resource failure occurrence, the correlation key is built comprising information on the affected link as it is seen from the respective neighbouring node; PA1 the physical alarm is triggered; it contains the correlation key with location information, which forms part of the alarm data; PA1 the failure information and the correlation key are transmitted to the access nodes of the network; PA1 in the access nodes, the affected logical resources are associated to the physical failure; PA1 the alarm related to the logical resource is built with the correlation key .
Error correlation functions in this environment can be based on:
Networking evolves to higher speeds, thus offering appropriate infrastructure for emerging multimedia applications. High speed networking provides physical media transport over 2,000 kilobits per second. When the network provides such a very high bandwidth, then also the number of granular or elementary accesses is very high. Thus, such speeds lead to an increasing number of logical resources (e.g. protocol interfaces, connections) that the physical media can serve. The additional complexity introduced by the large number of supported resources in these new generation networks, requires developing the classical network architecture to a distributed network structure. The classical network architecture associates physical media support with the physical protocol layer, failure detection and some corrections at the link protocol layer and above in higher layers, and network management in applications, whereas emerging networks tend to distribute some network management functions into protocol layers, such as: physical media backup decision and operation by the physical protocol layer, connectivity backup decision and operation by link protocol layers.
Two consequences on Fault Management derive from current network evolution:
The following new requirements derive from this consequence:
The application of the current solutions to high speed and dynamic networks would lead to a network flooded by network management traffic (mainly due to asynchronous events), and retrieval of wrong information as the network may potentially have decided to redistribute logical resources to new, healthy physical resources. Therefore, when a physical resource failure occurs, the associated alarm would be triggered. Then, each affected logical resource would trigger an alarm and the network operator would be flooded by hundreds of alarms due to one failure without any analysis tool to use.
This demonstrates, that usual correlation algorithms are no longer appropriate to current high speed networks, the main inhibitors being the amount of logical resources and the topology dynamias.
The following new requirements derive from this consequence: the need to correlate a physical resource failure with the logical resources which were previously served by this failed entity, and the need to restrict the overall fault management flow to avoid excessive network bandwidth utilization for network management purposes.