1. Field of the Invention
The present invention relates to a method and apparatus for managing and maintaining a data communication network. More particularly, the present invention relates to a method and apparatus for identifying the errors and failures created by service components within a distributed computer network, notifying system administrators of such errors and failures and an automated approach to restarting the failed components.
2. The Background
The ability to provide data communication networking capabilities to the personal user and the professional community is typically provided by telephone companies (Telcos) or commercial Internet Service Providers (ISPs) who operate network access points along the information superhighway. Network access points which are commonly referred to as Points of Presence or PoPs are located within wide area networks (WAN) and serve to house the network interfaces and service components necessary to provide routing, bridging and other essential networking functions. It is through these network access points that the user is able to connect with public domains, such as the Internet and private domains, such as the user""s employer""s intranet.
The ISPs and Telcos maintain control of the network interfaces and services components comprising the data communication network at locations commonly referred to as Network Operation Centers (NOCs). It is here, at the NOCs, where the ISPs and Telcos employ service administrators whose task is to maintain and manage a finite sector of the overall data communications network. Managing and maintaining the interfaces and services that encompass the network is complicated. The interfaces and services that a system administrator has responsibility for are not confined to the NOC, but rather remotely dispersed throughout the PoPs. For example, the NOC may be located in San Jose, Calif. and the services and interfaces for which the system administrator has responsibility for may be located at PoPs in San Francisco, Calif., Los Angeles, Calif. and Seattle, Wash. The remoteness of the interfaces and services make it difficult for the system administrator to oversee the system from one fixed location, such as the NOC.
It is the common knowledge of anyone who has used computers in a network environment that problems related to the interfaces and services are the rule and not the exception. The vast majority of these problems are minor in nature and do not require the system administrator to take action. Networks have been configured in the past so that these minor errors are self-rectifying; either the interface or service is capable of correcting its own error or other interfaces or services are capable of performing a rescuing function. In other situations the problems that are encountered within the network are major and require the system administrator to take action; i.e., physically rerouting data traffic by changing interfaces and services.
It is the desire of the service providers to have a maintenance and management system for a data communication network that allows the system administrator the ability to accumulate quality and reliability data on all the interfaces and services in use. If a system administrator has real-time access to the performance history of each interface and service the administrator can then predict future performance. For example, the system administrator can assess the performance history for a given service over a specified period of time. If the history shows that the service has performed below maximum capability or a trend in recent self-corrected errors has arisen, then the system administrator can make adjustments accordingly. These adjustments may be, for example, choosing to shut down that particular service or limiting the amount of data traffic volume encountered by that service. Having the capability to assess prior performance history and make adjustments accordingly allows the service provider to be pro-active and to limit future major failures from occurring.
While the service providers want access to information pertaining to any and all errors occurring within the distributed communication network, they also desire that the maintenance and management system be as self-rectifying as possible. Not only should minor errors be self-corrected, but major failures should be self-corrected as well. This includes using necessary watchdog mechanisms that cause failed components and services to be restarted. Additionally, the watchdog itself must be self-rectifying as an added measure of overall reliability insurance. In this manner the service provider is able to maintain and manage the data communication network without the need for having more personnel than necessary to monitor and manipulate the network on an ongoing basis.
A method and apparatus for providing management and maintenance to a node within a data communications network and a method and apparatus for providing management and maintenance to the composite data communications network. A master daemon located at the node is activated. The master daemon starts a control adapter running on the node and if the control adapter stops then the master daemon restarts the control adapter. The control adapter is capable of starting and stopping all services running on the node. Signals are communicated between the node and the services by way of adapters. Signaling provides for the exchange of useful event data related to the nodes and services comprising the data communication network.
In another aspect of the invention, a network management application is started on a host located at a network operation center. The network management application is in communication with network nodes and services through an adapter. Signals are communicated between the management application, the node and the services by way of adapters. Signaling provides for the exchange of useful event data related to the nodes and services running on the nodes.
In another aspect of the invention, the network management application has an association with a database of information. The network management application is in communication with an information bus through an adapter so as to update the contents of the data base from events signaled by adapters located at the node of the data communications network.