The present invention generally relates to data processing in the field of networking. The invention relates more specifically to suppressing dependent alarms that are caused by other alarms using software event processing in a distributed network management system.
Network management systems are now in wide use for the purpose of facilitating administration, configuration, and monitoring of complex local area networks, wide area networks, campus networks, etc. An example of a commercially available network management system is Cisco WAN Manager, available from Cisco Systems, Inc., San Jose, Calif.
Some network management systems are implemented using object-oriented computer programming development environments. In these systems, it is convenient to represent physical elements of a real-world network, such as routers, switches, and their components, in terms of programmatic objects and instances of the objects. Cisco WAN Manager, for example, uses a set of managed objects and a set of events generated by the managed objects.
A managed object is a resource within a system that may be managed through the use of some management protocols. For example, in a telecommunication network, switches (or nodes), cards, and ports can be managed using the SNMP protocol and may be represented by objects that are instantiated by the network management system. Managed objects may comprise physical managed objects or logical managed objects. Physical managed objects are resources that are defined by physical hardware components. Examples of physical managed objects that are useful in representing a telecommunication network include nodes, cards, ports, and trunks. Logical managed objects, in contrast, are supported by one or more hardware components. Examples of logical managed objects include end-to-end user connections, and endpoints of user connections.
Physical managed objects may be related to each other using one or more object containment relationships. Managed object instances that contain other managed objects are called composite managed objects. Typically, managed objects are created and stored in a structure that has a directed acyclic graph topology. For example, a node object contains all cards of that node, a card contains all ports on the card, etc.
Logical managed objects may be related to each other using an association relationship. A logical managed object instance is associated with another if the object instances, and possibly other logical managed objects, make up a higher level logical managed object. For example, one endpoint is associated with another endpoint if they are endpoints of the same user connections.
A user connection or segment may also be associated with another if they form a larger user connection.
A logical managed object may also be contained by physical managed objects or other logical managed objects. For example, a port (a physical managed object) contains all endpoints on the port. A user connection is contained by its endpoints. Note the difference of containment relationship between logical managed objects and physical managed objects. A physical managed object contains other physical managed objects if it is composed of those physical managed objects. A logical managed object, on the other hand, is contained by other logical managed objects if it is composed of those logical managed objects. A logical managed object is contained by a physical managed object if it is supported by the physical managed object.
Events indicate the occurrence of a monitored condition of managed objects. Events of physical managed objects are generated by physical managed objects, while events of logical managed objects are generated or derived by the management system. An event on a physical managed object may affect both the physical managed object and some logical managed objects. For example, a port failure event will affect both the physical managed object (the port) and logical managed objects (endpoints on the ports and corresponding connections).
Events have various relationships to one another. Time is one relation: event A happens before event B. Cause is another relation: event A caused event B to happen. If event A caused event B to happen, then event A must have happened before event B. In general, an event on a managed object may cause events on other managed objects contained by it to happen. For example, a failure event on a port will result in a failure event on all endpoints on the port, and subsequently, result in failure events on user connections contained by the endpoints.
The state of a managed object comprises self, secondary, and parent states. The parent state is for all managed objects and is defined as the (self) state summary of all containing managed objects. Thus, for logical managed objects, the fail parent state always implies the fail self state.
The self state of a physical managed object is defined as the condition of the hardware component, while the self state of a logical managed object is fail if any physical managed objects that support the logical managed object is in fail self state. A logical managed object may also be in fail state for its own reasons. For example, a user connection will be in fail state if it is not configured properly, even if all supporting (or containing) physical managed objects are in OK state.
The secondary state is only for logical managed objects and is defined as the self state summary of all associated logical managed objects.
Event processing is useful in management of a data and telecommunication network that uses the foregoing object model. A data and telecommunication network is a geographically distributed collection of interconnected subnetworks for transporting data between stations. An end-to-end user connection consists of segments from the subnetworks. A user connection state is defined as aggregation of those of its component segments. Network management in such a system is hierarchical; each subnetwork is managed independently by separate network management system, responsible for segments in the subnetwork. There is a global network management system responsible for user connections across subnetworks.
In the above-described model, network elements are physical managed objects managed by element managers, while segments are logical managed objects managed by connection managers. Network elements generate events (or alarms) when monitored conditions occur. One task of element managers is to report alarms on network elements to the connection manager when segments are affected. Similarly, a connection manager in each subnetwork is responsible for generating (and reporting) alarms for segments contained by the network elements to the global connection manager (when user connections are affected). The global connection manager will then decide if and how to report the alarms to an end user of the network management system.
Event processing, or more specifically alarm suppressing, is an important issue in data and telecommunication network management. Distributed systems in enterprises as well as telecommunication environments demand more automated fault management. A single fault at element level in these complex systems might cause a huge number of symptomatic error messages and side effects to occur at all levels. The common root faults for these symptoms have to be identified to start fault removal procedures as soon as possible and to decrease system down-time.
In particular, when one or more network elements fail and the failure is later cleared, users who monitor the network should be informed about network state change by means of an alarm. Often, when a network element fails, a large number of network alarms will be generated from all affected network elements and connection segments. Event correlation is a technique that correlates a large number of network alarms into a small number of root cause alarms.
In large data and telecommunication network that interconnects multiple subnetworks, there is also a need for event correlation at the global network level, in addition to event correlation at the subnetwork level. For example, a port failure at one subnetwork may be detected by another subnetwork. One subnetwork reports to the global network management system a port failure affecting a number of segments, and the other subnetwork reports a secondary or A-bit failure, affecting a different set of segments. The global network management system is responsible to correlate the two alarms, and in this case suppress the A-bit alarm, as it is caused by port failure at other subnetwork.
Alarm correlation is the process by which several alarms are narrowed from a mass of problems to a root cause to report and side effects to suppress. Alarm suppressing can be local to a subnetwork or cross subnetwork boundaries. For example, the port failure at element level in a subnetwork may be detected by other subnetworks: one subnetwork reporting a port failure affecting a set of segments, and other subnetworks reporting so called A-bit failures affecting different sets of segments. It is the network management system""s responsibility to correlate these alarms, and in this case suppress A-bit alarms, as they are caused by port failure at other subnetwork.
Different approaches to this problem have been proposed. Various approaches are based on state machines, rule processing, and codebooks. A number of commercial systems are also available based on the technology, e.g., the Event Correlation Service of Hewlett-Packard OpenView, Event Correlation Solutions offered by Lucent Technologies, Inc., NerveCenter from Seagage, and InCharge from SMARTS. Further information about these systems is set forth in Kent Sheers, HP OpenView Event Correlation Services, Hewlett-Packard Journal, October 1996; Lucent Technologies, Event Correlation Solution, document 5683FS.pdf at the Lucent Web site; Seagate Corp., Enterprise Event Automation with Seagate NerveCenter, a white paper available from Seagate; Shaula Yemini, et al., High Speed and Robust Event Correlation, IEEE Communication Magazine, May 1996.
Techniques have been developed and widely used to correlate events at network element level. For example, when a line fails, all ports in the line also fail. But all network alarm events about port failure will be suppressed, as they are all caused by the line failure.
General event correlation techniques have also been developed that allow users to define their own event correlation login at a higher level, such as in the case of a subnetwork that is managed by a single network management system. An example of such an approach is the Hewlett-Packard OpenView Event Correlation Service. A general event correlation system can be used for event correlation at the global network management level.
Unfortunately, these systems have numerous drawbacks. For example, while they are powerful and are designed to handle all kinds of alarms at different levels of abstraction, they offer too much power and overhead for some management situations. They are complex and thus difficult to use, requiring significant customization, e.g., writing event correlation rules, building a behavior model for a state machine, or designing event models of managed objects. They are computationally expensive to use and incur significant runtime overhead. For example, to carry out rule evaluation or for minimal distance decoding, they must carry out database operations and/or complex rule evaluation at run time.
In other cases, an alarm may arrive but there may be insufficient information to determine whether it should be suppressed. In these cases, prior approaches may suppress the alarm anyway, even though later information may indicate that the alarm should not be suppressed.
Based on the foregoing, there is a clear need in this field for a simple and accurate alarm suppression method and system.
Further, there is a need for a method or system of alarm suppression that can determine when a particular alarm is likely to be a side-effect alarm, but that can suspend action on that alarm until receiving additional information that confirms that the alarm is a side-effect alarm that should be suppressed.
A method and apparatus for suppressing side effect alarms in a network communication system that arrive out of order, based on state change and the alarm reporting history of logical managed objects, such as user connections, is disclosed. State information is maintained for each of a plurality of interested logical managed objects that represent user connections, comprising parent object state, primary state, and secondary state. The parent object state is OK if all parent objects (lines, ports, etc.) of the connection are functioning properly, and FAIL otherwise. The primary connection state is OK if the entire connection is functioning properly. The secondary state or A-bit state is FAIL if a failure at one subnetwork is detected by other subnetworks. The system also maintains information indicating the last generated alarm for each interested logical managed object. A new state of each interested logical managed object is computed when alarms on its containing physical managed objects or associating logical managed objects have been reported. The method then decides whether to report or suppress the alarms, based on a lookup operation using a decision table. If an alarm is suspected of being a side-effect alarm, based on selected conditions, alarm information is placed in a queue to await the arrival of a second, related alarm that confirms that the first alarm was a side-effect alarm, and the side-effect alarm is then suppressed.