1. Field of the Invention
This invention relates generally to communications networks, and more particularly, to communications networks having multiple domains, each of which may cause intra-domain alarms. These intra-domain alarms may be correlated to provide inter-domain alarms and to facilitate more effective user notification and corrective action.
2. Discussion of the Related Art
Computer networks are widely used to provide increased computing power, sharing of resources and communication between users. Networks may include a number of computer devices within a room, building or site that are connected by a high-speed local data link such as token ring, Ethernet, or the like. Local area networks (LAN""s) in different locations may be interconnected by for example packet switches, microwave links and satellite links to form a wide area network (WAN). A network may include several hundred or more connected devices, distributed across several geographical locations and belonging to several organizations.
Many existing networks are so large that a network administrator will partition the network into multiple domains for ease of management. There are various types of domains. One example is based on geographical location. For example, a company may own or manage a network that includes a first domain geographically located in a first city and a second domain geographically located in a second city, as well as other domains disposed in other geographical locations.
Another domain type is based on organization or departments, e.g., accounting, engineering, sales, etc. A company may have a computer network spanning multiple organizations and multiple geographical locations, but there may not be a one-to-one mapping of organizations to geographical locations. Thus, a first organization and a second organization may both share network resources within first and second geographical locations. For purposes of network accounting (e.g., to allocate network charges to the appropriate organization) or for other reasons, it may be advantageous to consider the network resources of the first organization as being a separate domain from the network resources of the second organization.
A third example of a domain type is a grouping based upon functional characteristics of network resources. For example, one functional domain may be considered to be network resources belonging to a company that are provided for performing computer-aided design, which may draw upon common databases and have similar network traffic. Another functional domain may be network resources of the same company that are provided for financial analysis, which may be resources specially adapted to provide financial data. The network resources of these two domains may be distributed across several geographical locations and several organizations of the company. However, it may be desirable for a network administrator to group the computer-aided design network resources into one domain and to group the financial analysis network resources into another domain. Additional examples of communication network domains also exist, and a single company or organization may have domains that fall into several categories.
The above examples were discussed with respect to one company owning and managing its own network. Similar situations exist for any entity that manages and/or owns a network, for example a service company that provides network management services to several companies.
In the operation and maintenance of computer networks a number of issues arise, including traffic overload on parts of the network, optimum placement and interconnection of network resources, security, isolation of network faults, and the like. These issues become increasingly complex and difficult to understand and manage as the network becomes larger and more complex. For example, if a network device is not sending messages, it may be difficult to determine whether the fault is in the device itself, a data communication link, or an intermediate network device between the sending and receiving devices.
Network management systems are intended to resolve such issues. Older management systems typically operated by collecting large volumes of information which then required evaluation by a network administrator, and thus placed a tremendous burden on and required a highly-skilled network administrator.
Newer network management systems systematize the knowledge of the networking expert such that common problems of a single domain (i.e., a portion of the network under common management) can be detected, isolated and repaired, either automatically or with the involvement of less-skilled personnel. Such a system typically includes a graphical representation of that portion of the network being monitored by the system. Alarms are generated to inform an external entity that an event has occurred or requires attention. Since a large network may have many such events occurring simultaneously, some network management systems provide alarm filtering (i.e., only certain events generate an alarm).
Commercially available network management systems and applications for alarm filtering include: (1) SPECTRUM(copyright), Cabletron Systems, Inc., 35 Industrial Way, Rochester, N.H. 03867; (2) HP OpenView, Hewlett Packard Corp., 3000 Hanover Street, Palo Alto, Calif. 94304; (3) LattisNet, Bay Networks, 4401 Great American Pkwy., Santa Clara, Calif. 95054; (4) IBM Netview/6000, IBM Corp., Old Orchard Road, Armonk, N.Y. 10504; (5) SunNet Manager, SunConnect, 2550 Garcia Ave., Mountain View, Calif. 94043; and (6) NerveCenter, NetLabs Inc., 4920 El Camino Real, Los Altos, Calif. 94022.
However, in each instance the existing network management system manages only a single domain. For example, a company having a network consisting of several domains will typically purchase one copy of a network management system for each domain. Each copy of the network management system may be referred to as an instance. Thus, in the functional domain example described above, a first instance of a network management system may manage the computer-aided design domain, while a second instance of a network management system may manage the financial analysis domain. Each instance of the network management system receives information only from the resources of a single respective domain, and generates alarms that are specific only to the single respective domain. Such alarms may be referred to as intra-domain alarms.
Because each instance of a network management system manages only one domain, there is currently no diagnosis or management which takes into account the relationships among multiple domains. Since domains may be interconnected, an intra-domain alarm might be generated for a first domain, even though the event or fault that is causing the intra-domain alarm may be contained within the network resources of a different domain. For example, a first domain in a network may include a router that forwards network traffic to a resource in a second domain. If the router fails or begins to degrade, the performance of the second domain may appear sluggish (e.g., excessive delays, low throughput), even though the network resources within the second domain are operating correctly. This sluggishness may cause an alarm to be generated from the instance of the network management system that manages the second domain. However, no alarm relating to this situation has been generated by the first instance of the network management system that manages the first domain, because there is no performance degradation within the first domain. It is currently necessary to apply human intervention and human reasoning to resolve such a situation.
According to one aspect of the invention, a multi-domain alarm manager provides alarm correlation among a plurality of domains in a communications network. Individual network management systems each monitor a single respective domain of the communications network, and provide intra-domain alarms indicative of status specific to the single respective domain. The manager receives the intra-domain alarms, and correlates them to provide inter-domain alarms as well as responses in the form of corrective actions. The manager thus provides a high level of correlation and response for the entire network while each single-domain network management system provides a lower level of correlation and response for an individual domain of the network.
According to a method embodiment of the invention, the method comprises the steps of receiving a first intra-domain alarm from a first domain, receiving a second intra-domain alarm from a second domain, and correlating the first alarm with the second alarm to generate an inter-domain alarm.
In the above embodiments, the inter-domain alarm may be analyzed to determine a corrective action, and a command may be provided to a network resource within the communications network to implement the corrective action. Moreover, status information may be received from at least one resource in the first domain, and a first portion of the status information may be correlated to generate a first intra-domain alarm. A second portion of the status information may be correlated to determine a second corrective action, and a second command may be provided to a second network resource within the communications network to implement the second corrective action.
In a particular embodiment, correlating includes determining a domain that is adjacent to a first domain. Correlating may also include determining a severity of a condition indicated by a combination of a first intra-domain alarm and a second intra-domain alarm, the inter-domain alarm including an indication of the severity of the condition. Additionally, correlating may include providing a first intra-domain alarm and a second intra-domain alarm to a state machine, and receiving an output from the state machine indicative of a severity and a correlation of a combination of the first intra-domain alarm and the second intra-domain alarm.
By providing two levels of correlation, one at the network management system level (within a domain) which utilizes for example model based reasoning, and a second at the alarm manager level (across multiple domains) which utilizes for example a state transition graph (or case based reasoning or intelligent systems), an improvement in scaleability is provided not possible with prior systems.