1. Field of the Invention
The present invention relates to a method and apparatus for managing and maintaining a data communication network. More particularly, the present invention relates to a method and apparatus for identifying the errors and failures created by service components within a distributed computer network, notifying system administrators of such errors and failures and an automated approach to restarting the failed components.
2. The Background
The ability to provide data communication networking capabilities to the personal user and the professional community is typically provided by telephone companies (Telcos) or commercial Internet Service Providers (ISPs) who operate network access points along the information superhighway. Network access points which are commonly referred to as Points of Presence or PoPs are located within wide area networks (WAN) and serve to house the network interfaces and service components necessary to provide routing, bridging and other essential networking functions. It is through these network access points that the user is able to connect with public domains, such as the Internet and private domains, such as the user's employer's intranet.
The ISPs and Telcos maintain control of the network interfaces and services components comprising the data communication network at locations commonly referred to as Network Operation Centers (NOCs). It is here, at the NOCs, where the ISPs and Telcos employ service administrators whose task is to maintain and manage a finite sector of the overall data communications network. Managing and maintaining the interfaces and services that encompass the network is complicated. The interfaces and services that a system administrator has responsibility for are not confined to the NOC, but rather remotely dispersed throughout the PoPs. For example, the NOC may be located in San Jose, Calif. and the services and interfaces for which the system administrator has responsibility for may be located at PoPs in San Francisco, Calif., Los Angeles, Calif. and Seattle, Wash. The remoteness of the interfaces and services make it difficult for the system administrator to oversee the system from one fixed location, such as the NOC.
It is the common knowledge of anyone who has used computers in a network environment that problems related to the interfaces and services are the rule and not the exception. The vast majority of these problems are minor in nature and do not require the system administrator to take action. Networks have been configured in the past so that these minor errors are self-rectifying; either the interface or service is capable of correcting its own error or other interfaces or services are capable of performing a rescuing function. In other situations the problems that are encountered within the network are major and require the system administrator to take action; i.e., physically rerouting data traffic by changing interfaces and services.
It is the desire of the service providers to have a maintenance and management system for a data communication network that allows the system administrator the ability to accumulate quality and reliability data on all the interfaces and services in use. If a system administrator has real-time access to the performance history of each interface and service the administrator can then predict future performance. For example, the system administrator can assess the performance history for a given service over a specified period of time. If the history shows that the service has performed below maximum capability or a trend in recent self-corrected errors has arisen, then the system administrator can make adjustments accordingly. These adjustments may be, for example, choosing to shut down that particular service or limiting the amount of data traffic volume encountered by that service. Having the capability to assess prior performance history and make adjustments accordingly allows the service provider to be pro-active and to limit future major failures from occurring.
While the service providers want access to information pertaining to any and all errors occurring within the distributed communication network, they also desire that the maintenance and management system be as self-rectifying as possible. Not only should minor errors be self-corrected, but major failures should be self-corrected as well. This includes using necessary watchdog mechanisms that cause failed components and services to be restarted. Additionally, the watchdog itself must be self-rectifying as an added measure of overall reliability insurance. In this manner the service provider is able to maintain and manage the data communication network without the need for having more personnel than necessary to monitor and manipulate the network on an ongoing basis.