This is the first application filed for the present invention.
Not applicable.
The present invention relates to Network Management Systems, and in particular to a method and system for enabling reliable network fault monitoring in an inherently unreliable network transport environment.
The conventional network space comprises a layered architecture of a network transport fabric comprising Network Elements (NEs) (e.g., switches, routers etc.) for end-to-end transport of payload data across the network, and a network management layer for controlling operation of the NEs and providing network administrative services.
A typical network management model includes: Management Stations; a Management Information Base (MIB); Management Agents; and a Management Protocol.
Management Stations are also known as network managers, and may comprise stand-alone devices and/or a distributed platform which communicate with one or more Management Agents. Management Stations typically have a set of management applications for monitoring, analyzing and presenting management data. They may also provide a user interface and access point for human operators.
A Management Information Base (MIB) is a collection of managed objects. Each MIB object is generally defined as a data variable representing network resources, resource components, as well as their respective attributes, status and performance statistics. MIBs represent the data model of the network, and typically provide an open interface for multi-vendor inter-operability.
Management Agents typically implement the MIB for the managed resources in their context, and support the required protocol interactions with the Management Stations. These agents may also serve as proxies for devices that do not have the capability to support the standard protocol suite.
The Management Protocol specifies interaction models between the Management Stations and the Management Agents via operation directives and notification mechanisms. This includes predefined message sets exchanged between a manager and an agent.
Within the above-described network management model, the Management Stations are conveniently divided into Network Management Systems (NMSs), and Element Management Systems (EMSs). Each EMS is connected to one or more NEs, and operates to manage the operation of the NEs within its domain. Each EMS interfaces with an NMS which operates to provide end-to-end network administration and management functionality (including, where applicable, user interfaces for human operators).
Currently, three major standards organizations are working on standards for network management systems. They include: Internet Engineering Task Force (IETF); Open Systems Interconnection (OSI); and, International Telecommunications Union-Telecommunications Standards Sector (ITU-TS). The standard adopted by ITEF is the Simple Network Management Protocol (SNMP). SNMP is designed for enterprise data communications networks, and its flexibility and simplicity make it the most popular standards implemented in such networks. The OSI and ITU-TS are each working on a standard called xe2x80x9cCommon Management Information System (CMIS). CMIS is an object-oriented network/system management solution with well-defined management objects information and is recommended as a solution for carrier-grade network management.
SNMP is a set of standards for network management that includes: a Management Protocol; a MIB specification methodology; and administrative control to handle manager-agent interactions. SNMP resides at the application layer of the OSI model and is typically implemented over an unreliable transport service, namely the User Datagram Protocol (UDP), which is a connectionless protocol over Internet Protocol (IP). SNMP has undergone a number of revisions to provide functional enhancements. For example, SNMP v2c enhances the SMI, offers manager-to-manager notification capability, defines powerful protocol operations and an elaborate set of return codes. SNMP v3 augments SNMP v2 by introducing a security and administration framework.
As mentioned above, UDP is a connectionless protocol over IP, so delivery of SNMP notifications transmitted between an EMS and an NMS over UDP cannot be guaranteed. This inherent unreliability of the network signaling environment precludes carrier-grade reliable network management.
Accordingly, there is a need for systems for enabling carrier-grade reliable network management in an inherently unreliable network transport environment.
Network management includes the following five functional areas:
1) Fault management;
2) Performance management;
3) Accounting;
4) Configuration; and
5) Security management.
Each functional area includes many related management functions. One important function of fault management is fault monitoring. The fault monitoring function detects the failure of systems to meet their operational objectives. Fault monitoring is the basis for further fault diagnosis and correction. Fault monitoring is always important, especially in a carrier-grade network. A carrier-grade fault monitoring system must conform to a few basic criteria:
a) 100% Reliabilityxe2x80x94Any method and system designed for achieving the carrier grade network management should provide 100% reliability in collecting and receiving network fault information.
b) Synchronizationxe2x80x94The monitoring system must define a procedure to keep the monitoring system and the monitored system in synchronization with respect to the fault information at a given point. Synchronization includes:
a. initial startup synchronization
b. lose/regain communication synchronization
c. continuous out-of-synchronization recovery
c) Sequencexe2x80x94To avoid corrupting the integrity of alarm information. It is generally necessary to process the alarm information in time sequence. The managed system should send alarm information in the time sequence. The management system should also process alarm events in time sequence.
d) Timelinessxe2x80x94The mechanism should permit the recovery of lost alarm information in a timely fashion (within the tolerance of network management requirements).
e) Efficiencyxe2x80x94The network traffic involved in achieving reliable fault monitoring should be kept as low as possible. Generally, the network management traffic should not consume more than about 5% of network capacity under normal conditions.
f) Standards Based Open Interfacexe2x80x94The interface defined and employed by the system should adhere to certain standards to achieve the maximum openness.
Due to the recent convergence of data communications and telecommunications, as well as the high cost of CMIS, network administrators have begun to use SNMP to manage carrier-grade networks. As mentioned above, the SNMP has great flexibility and simplicity. To achieve the flexibility and the simplicity, SNMP has not standardized what should be defined in the MIB.
However, OSI/ITU-T standards specify useful management information that is appropriate for carrier grade fault monitoring. There therefore exists a need for an SNMP MIB that includes key management data to provide a richer data model that is more functional and useful for reliable fault monitoring. As mentioned above, SNMP is typically implemented over UDP, which offers no transport service guarantees, and this inherent unreliability challenges carrier-grade fault monitoring.
There therefore exists a need for an innovative solution to defining the required MIB data and specifying expected behavior in the application layer protocol engines of management system to ensure accurate data synchronization under various network conditions.
An object of the present invention is to provide a system for enabling carrier-grade reliable fault monitoring using a simple, inherently unreliable management protocol such as SNMP by incorporating the surveillance data specified by the OSI/ITU-TS standards for the MIB definition, as well as a mechanism for ensuring reliability.
Accordingly, an aspect of the present invention provides a method of enabling reliable network fault monitoring using an unreliable network management protocol, such as SNMP, for example. The method includes the steps of: receiving notifications sent over the unreliable network transport environment, each notification having a unique transmitted notification sequence number (TxNSN); detecting a missing notification on the basis of the respective TxNSN""s of received notifications; and sending a polling request for the missing notification.
Another aspect of the invention provides guidelines for converting a well-defined network management information model into a MIB for use in a simple network management protocol, and for using the information in management operations between the management system and the managed system. The managed objects and their container relationship defined the OSI/ITU-TS standards are captured and stored in the simple network management protocol MIB, or sent along with the notifications. A subset of the object attributes defined in the OSI/ITU-TS standards are captured and stored in the simple network management protocol MIB are sent along in the notifications.
Another aspect of the present invention provides a system for enabling carrier-grade network fault monitoring in an unreliable network transport environment. The system includes a first manager which is an Element Management System (EMS), and a second manager which is a Network Management System (NMS). The first manager is operatively connected for bidirectional communication over the unreliable network transport environment. The first manager collects and stores management information (objects and their attributes) in the MIB. The first manager is adapted to formulate and send notifications over the unreliable network transport environment, each notification including the required attributes, and having a unique transmitted notification sequence number (TxNSN). The second manager is operatively connected for bi-directional communication with the first manager over the unreliable network transport environment. The second manager comprises: detection means responsive to notifications received from the first manager detecting a missing notification on the basis of the respective TxNSN""s of received notifications; and polling means responsive detection of a missing notification for sending a polling request to the first manager for retrieving data from the missing notification; and, synchronization means for initial and continuing fault information synchronization with the first manager.
A further aspect of the invention provides a manager for enabling reliable management in an unreliable network transport environment in which the manager comprises an interface operatively connected for reception of management data from a managed resource within a management domain of the manager. A notification entity is responsive to the received management data and formulates a notification indicative of the received management data. The notification includes a respective unique transmitted notification sequence number.
A still further aspect of the invention provides a manager for enabling reliable management in an unreliable network transport environment in which the manager comprises synchronization means for initial synchronization with the managed system; and, detection means for detecting notifications received over the unreliable network transport environment. Each notification includes a respective unique transmitted notification sequence number (TxNSN). The detection means is adapted to detect a missing notification on the basis of the respective TxNSN""s of the received notifications. Polling means for detecting a missing notification and sending a polling request for the missing notification; and, polling means for detecting communications loss and for detecting re-establishment of operations and management (OAM) communications and sending appropriate requests for overall management data re-synchronization.
In one embodiment of the invention, the first manager comprises an interface operatively connected for reception of management data from a managed resource within a management domain of the first manager. A notification entity responsive to the received management data formulates a notification indicative of the received management data. Preferably, the notification entity is responsive to the management data and formulates a notification corresponding to a selected one of a set of predetermined notification types. In a preferred embodiment, the set of predetermined notification types comprises any one or more of: Enrol Notifications; De-enrol Notifications; State Change Notifications; Attribute Change Notifications; and Alarm Notifications.
In another embodiment of the invention, the first manager further comprises a first management information base that includes a current notification sequence number; and a notification log. Preferably, the first management information base further includes information respecting one or more of: an identity of a managed resource within the management domain; a state of the managed resource; and alarm notifications sent by the first manager.
The notification entity preferably increments the current notification sequence number to a next higher value after assigning a notification sequence number to a TxNSN of a notification.
In an embodiment of the invention, following transmission of a notification to the second manager, the first manager is adapted to back-up contents of the transmitted notification in the notification log.
The interface of the first manager is preferably adapted to detect a plurality of predetermined alarm events, and store the alarm events in the MIB. The first manager preferably further comprises a buffer for temporarily storing notifications sent over the unreliable network transport.
The second manager preferably further comprises a second management information base including: a last processed notification sequence number; information respecting an identity of managed resources within a domain of the first management system; and, a state of each managed resource. Preferably, the second management information base further comprises information respecting alarms raised by the first manager.
The second network manager is preferably further adapted to process a received notification if its TxNSN is consecutively larger than a value of the last processed notification sequence number. Upon processing the notification, the second manager preferably increments a last processed notification sequence number to a next larger consecutive value. Preferably, the second manager is further adapted to discard a received notification if its TxNSN is less than or equal to a value of the last processed notification sequence number. If the TxNSN is greater than the value of the last processed notification sequence number by more than one, the second manager is further adapted to initiate recovery polls to retrieve the data from the lost notifications.
During either of a start-up or a restart operation of the first manager, the notification entity is preferably adapted to formulate a cold-start notification and transmit the cold-start notification to the second manager. The second manager is also adapted to detect restarts by querying the sysUPTime variable and exhibits behavior similar to that described above. Restart recovery procedures are not based solely on an unreliable coldstart notification.
During either one of a start-up operation of the second manager, or recovery of communications between the second manager and the first manager, the second manager is further adapted to control the polling means for sending a polling request to the first manager requesting the value of the current notification sequence number. Upon reception of the requested information extracted from the first manager, the second manager initializes the last processed notification sequence number to equal the value of the current notification sequence number. Preferably, following initialization of the last processed notification sequence number, the second manager is adapted to control the polling means for sending polling requests to the first manager requesting transmission of notifications containing data extracted from the first management information base. The second network manager updates its local management information with the information contained in subsequently received response messages containing the requested data from the first management information base.