1. Technical Field of the Invention
The present invention relates to fault management systems and, in particular, to a method and apparatus for correlating alarms generated by network elements within a given network comprising, for example, a telecommunications or data network.
1. Description of Related Art
In a network, such as a telecommunications or data network, a single fault within or concerning the network may generate multiple alarms from network elements over space and time. It is imperative that the network operator be able to evaluate these alarms to determine the cause of the fault. This procedure involves first correlating the alarms to together by recognizing that the plural alarms are caused by the same network fault. Once the fault is isolated in this manner, the corresponding cause may be addressed and corrected. In a large network, where simultaneously occurring faults may exist, and when a storm of network element alarms may be generated, the correlation operation is much more complex and it becomes more difficult for the network operator to partition the plural alarms into associations relating to individual faults. What is needed is an apparatus and method for assisting the network operator with this correlation process when dealing with multiple alarms that arise from unrelated network faults.
Network elements are organized in a number of topologies. Hierarchical arrangement, for example, is prevalent in real networks. Examples that can be captured by such arrangements are the digital hierarchy of a transmission network, network and sub-network relations and network resource name conventions. It would be an advantage if the apparatus and method for correlating alarms could take advantage of such hierarchical and topological information concerning the managed network to assist in and speed the correlation process.
Alarms occurring in network elements placed at lower levels of the hierarchy tend to propagate to higher level network elements. In some instances, network operators recognize that certain types of alarms resulting from a given fault tend to propagate from element to element through the network in a certain manner (perhaps having some relation to hierarchy or topology). It would be an advantage if the apparatus and method for correlating alarms could take advantage of such propagation characteristics to assist in and speed the correlation process.
More generally, there is a need for an apparatus and method for correlating alarms in a managed network that is capable of near real-time correlation of a large number of simultaneous alarms with reduced time and computational resources.
An historical context is maintained containing sets previously built for previously received alarms. Each set therein contains not only a network element in an alarmed state but also network elements related to that network element by alarm propagation considerations and prior correlations. When a current alarm from a certain network element is received, a new set is built for that current alarm containing not only that certain network element but also other network elements related thereto by alarm propagation considerations. The new set is then merged with one of the previously built sets in the historical context if there exists a likelihood that the current alarm and the previously received alarm are caused by the same network fault. In one preferred embodiment, a likelihood is deemed to exist when a network element is shared in common between the new set for the current alarm and a previously built set relating to a previously received alarm. In a more generic implementation, any suitable merger test could be defined, perhaps by a network operator, and used to measure correlation.