The data networks that are deployed by service providers or large enterprises often comprise hundreds or thousands of network devices. A network device may comprise one or more network elements, which are entities like modules, ports, slots etc. The network devices and their corresponding network elements may be managed by one or more network management systems, such as an operational support system (OSS), which are implemented using computer application programs that can communicate with the network devices. OSS applications are either obtained from commercially available sources or developed internally by telecommunications service providers.
When a network device detects a fault or error within itself or relating to one of its elements or relating to links to another device or elements, the network device generates an alarm message (“alarm” herein) and sends it to the network management system. To enable the network management system to detect fault conditions as they occur (“in real time”), some network elements are even designed and configured to generate and send such alarms repeatedly, until the fault or other causative condition is resolved or acknowledged. Such network devices may include routers, LAN switches, WAN switches, edge devices such as access routers, or other network elements, and system elements such as UNIX servers, etc.
Although this approach has the benefit of ensuring that alarms are known until they are resolved, it also creates certain management problems. In particular, isolating new alarms is difficult, because the processing required to uniquely identify an alarm is generally equal to the total alarm frequency multiplied by the number of network elements and multiplied by the length of the time period of observation. For example, one empirical study conducted by the inventor hereof identified, in a one-month observation period involving 9,000 network elements, over two million alarms representing only 129 unique alarm conditions.
Identifying the unique alarms requires extensive processing power and specialized knowledge of the syntax and semantics of the alarm messages. Further, these processing requirements, and the associated cost of analyzing the alarm messages, multiplied by the number of different software versions or revisions running on each network element or system element, adds significantly to the total cost of maintaining an OSS for a network. The initial investment of a service provider in an OSS and the cost of upgrading or modifying an OSS are huge, and therefore it is less desirable to upgrade an OSS to read or parse new alarm types that are introduced from time to time.
Still another problem involves propagation of alarms among different network elements. A large network may have network devices from many different vendors. Each vendor may define a unique fault or alarm type and structure for its network devices when standard alarm types are deemed inadequate. When one device fails and generates an alarm, the device may communicate the alarm to a device from a different vendor, which generates a new alarm that is semantically identical to the original alarm but that has a different syntax. As a result, in existing networks, many different fault management processing modules have been deployed as accessory products or external systems. These approaches have been taken because the structure and internal details of the alarms or fault events are not well understood. The owner or operator of the network may have difficulty in identifying the fault because the structure of the event or alarm is not well understood.
One approach to addressing the foregoing problems is correlating alarms based on a correlation key label. However, current alarm correlation approaches that use correlation labels have significant limitations. The key size is generally large and un-compressed. In a worst-case scenario, an uncompressed correlation key could be as large as the original correlated message, effectively doubling the size of alarm traffic. Also, the way each vendor generates the labels might not be unique across the heterogeneous network.
A related problem is that the different network devices from different vendors may communicate semantically identical alarms using different protocols such as SNMP, Log, XML, etc. Moreover, within a given protocol, different network devices may report alarms using different protocol messages. For example, two devices that both use SNMP to report alarms may use different SNMP traps to report alarms that are semantically identical.
Still another problem is that an OSS may receive thousands of the same kind of alarm messages that are semantically identical but reference different network devices or links. To determine which messages are semantically identical and reference the same fault condition, the OSS must parse and interpret the messages using extensive processing resources. Using a consistent trap type does not solve the problem. For example, assume that a fault condition is “Link Down” (a very common kind of fault) and that all SNMP devices of all vendors use the same kind of SNMP trap to report Link Down. Due to propagation of alarms along interconnected links, each Link Down trap message may include, nevertheless, different values for Node Name, IP Address, Link ID, etc., even when only one device is at fault. Therefore, extensive parsing and correlation is required at the OSS to isolate the source of the fault.
Furthermore, for the SNMP protocol, there is no way to formally define a correlation key value or index value in a MIB, making the fault management task less organized, which is undesirable. For example, the same ‘INDEX’ constructs used to represent key scalar or tabular attributes in an SNMP MIB cannot be used to represent the key value of Trap in the MIB.
Based on the foregoing, there is a clear need in this field for an improved method of generating correlating alarm labels for the alarms generated by network management systems.
There is a specific need for a way to uniquely identify semantically identical alarms that are generated from different devices or devices' elements in a manner that is consistent across devices from different vendors.
There is also a need for a way to uniquely identify semantically identical alarms that are generated from different devices using different protocols or different message types within a given protocol.
There is also a need to provide a way to identify alarms without adversely impacting the speed of an OSS or similar system that is carrying out fault correlation.
There is also a need for an approach that provides a compressible correlation key to preserve network bandwidth and provide better performance than uncompressed one.