The information-communication industry is an essential element of today's society, which is relied upon heavily by most companies, businesses, agencies, educational institutions, and other entities, including individuals. As a result, information service providers such as telephone, cable, and wireless carriers, Internet Service Providers (ISPs) and utility companies all have the need to deploy effective systems suitable for servicing such a demand. Accordingly, network management and operations have become crucial to the competitiveness of communication companies, utilities, banks and other companies operating Wide Area Networks (WANs) of computer devices and/or other network types and devices, including SONET, Wireline, Mobile, Internet Protocol (IP) devices, etcetera. For instance, many companies currently use customized “legacy” network management systems (NMSs) and operations support systems (OSSs). Various implementations of NMSs/OSSs are available in the prior art for managing networks and network elements.
Thus, management systems (“MSs,” which encompass both NMSs and OSSs) have been implemented in the prior art for managing communication networks and network elements. Given that it is often desirable to manage various network elements (e.g., various types of devices, including without limitation routers, switches, computer equipment, etcetera), various types of management systems have been developed for managing such elements. Further, because different types of network elements may communicate in different protocols, management systems may utilize different processes for managing different types of network elements. For instance, processes that may be referred to as “gateway” processes are sometimes implemented in management systems for managing particular types of network elements. For instance, a Simple Network Management Protocol (SNMP) gateway process may be implemented for managing SNMP devices, and a Common Management Information Protocol (CMIP) gateway process may be implemented for managing CMIP devices. Thus, one or more gateway processes may be implemented for managing network elements that communicate in a particular communication protocol.
Such gateway processes may, for example, receive unsolicited messages from their respective network elements and/or may poll their respective network elements for certain information. Prior art network management systems commonly recognize faults (or “traps”) generated within the network and/or utilize polling of the network elements to provide management. For example, IP and SNMP devices may generate fault messages (which may be referred to as traps), which are unsolicited messages that may be received by the management system. Examples of such trap messages include messages that indicate a network element's CPU utilization is too high, a network element just rebooted, available data storage capacity is low on a network element, and an interface on a network element is down, as examples. Various other types of unsolicited trap messages may be generated by a network element and received by a network management system, as those of ordinary skill in the art will recognize. Such messages are generally generated in a defined protocol, such as SNMP, which the management system can recognize (e.g., a gateway process may recognize) to process the received messages. As further examples, such information can also be received through TL1, CMIP, or ASCII messages, such as log files for different network elements.
Some network management systems may desire information regarding the performance of network elements that is not provided through unsolicited messages generated by such network elements. In such case, gateways may be implemented to poll their respective network elements for particular information. For instance a gateway may be implemented to poll its respective network element(s) to gather information about various operational characteristics of such network element(s). Gateways of prior art systems are typically implemented to periodically poll their respective network elements according to pre-set time intervals. For instance, a gateway may be pre-set to poll its respective network element(s) once every five minutes or once every twenty minutes, as examples. Gateways typically poll network element(s) to request values for various variables detailing information about the operation/performance of the network element(s). For example, a gateway may periodically poll a network element to determine whether the network element is operational and responding to the poll. If a network element fails to respond to such a poll, such failure to respond may be indicative of a problem with the network element, such as the network element having a hardware or software failure. As other examples, a gateway may periodically poll a network element to determine the workload being placed on such network element, the network element's available memory capacity, etcetera.
Depending on the amount of intelligence implemented within such gateway process, it may evaluate the performance of its respective network elements (e.g., based on unsolicited messages and responses to polling) and may trigger certain actions as necessary to manage the network elements. For instance, upon a fault message being received for a particular network element, the gateway process may generate an alert to a network administrator to notify the network administrator of such fault condition. As a further example, once a gateway receives the variable values from the network element(s) in response to a poll, the gateway may then process such variable values to monitor the operation of the network element(s). For instance, if a gateway polls a network element for a response and fails to receive such a response, the gateway may provide an alert to the network administrator (e.g., by presenting an alert message to a computer workstation) notifying the network administrator of a problem with the network element. Similarly, if a gateway polls a network element for its available memory and determines that such network element has little or no memory available, the network administrator may be alerted as to such condition.
Considering the great reliance that may be placed on such gateway processes in management systems for managing network elements, it becomes very desirable to efficiently detect and resolve failures of such gateway processes. For instance, if a gateway process fails for some reason, its respective network elements may go unmanaged. That is, when a gateway process fails, management of its respective network elements is interrupted, which is typically undesirable to a network provider. Such an interruption in the management of the network elements is typically undesirable to a network provider because, for example, an event may occur that effects the network elements during such interruption and the network provider would have no knowledge of such event.
Prior art implementations of network management systems often fail to efficiently detect failure of a gateway process. For example, a gateway process may fail without the management system or network administrator realizing such failure. For instance, if messages are not being received from a gateway process, the management system may assume that the gateway process is operational but simply has no messages to report to the management system (e.g., may assume that the gateway simply has nothing to report regarding its respective network elements). Thus, in some network management systems of the prior art, a gateway responsible for managing particular network elements may have failed long before the management system recognizes such gateway failure.
Also, prior art implementations of network management systems often fail to efficiently resolve the failure of a gateway process. For example, it may take an undesirably long time for another gateway process to be initiated for managing the network elements of the failed gateway process. Additionally, while a solution is being implemented to effectively recover management of the network elements of the failed gateway process, many messages (or events) regarding such network elements may be lost. That is, unsolicited messages (e.g., fault messages) are not being received from the network elements during the time required for recovering management, and polling of the network element is also not being performed. Once a management recovery solution is implemented (e.g., once another gateway process is initiated for managing such network elements), management of the network elements may resume. However, because management was interrupted, events may have transpired during such interruption indicating severe performance problems with one or more of the network elements, of which the newly initiated gateway process is unaware. Typically, messages lost during such interruption are not recovered, and therefore the newly initiated gateway may not efficiently recognize such severe performance problems.