1. Field of the Invention
The present invention relates generally to a method and apparatus for detecting a problem with a network device employed in a communication system and particularly to a method and apparatus for detecting a hardware or software-related problem within one or more network device among a large number of operational network devices within the communication system.
2. Description of the Prior Art
In modern communication systems, there may be a large number of network devices such as mail servers, routers and computers present within a system. Under such circumstances, it is common to have equipment failure, which would then require diagnostic evaluation and debugging. If the system includes hundreds of routers, such as in Cisco Systems Inc. laboratories, in order to identify the router that has failed, an engineer located at a technical support center must establish at least limited communication with every one of the routers (referred to as logging into the routers), of the large number of routers, in order to try to narrow the problem to one or more specific routers prior to diagnosis of the problem. This is commonly a considerably time-consuming and rigorous process. In fact, currently, among tens and hundreds of routers in operation, it is not unusual for engineers to spend one month in detecting a problem with a specific router.
Currently, when a component within a router fails, the router generates error messages for notification of the failure.
There are several ways in which a network communication system may fail. Among these are problems arising in the hardware and software components of various devices and communication lines and interfaces connecting the various devices of the communication system together. When there is a hardware problem, such as the failure of a board in one of the devices due to overheating, the driver in the device detects the problem by receiving an error message from the board thereby alerting the software that is being executed in the device of the device's failure. However, when the system fails, the valuable information regarding the reason for failure, which may be embedded in an error message in the software, may be lost, making the task of diagnosing the cause of failure more difficult and time-consuming by erasing any potential clues which might otherwise help an engineer in diagnosing the problem.
By way of execution of the software in a device, relevant information regarding the failure of the device exists but it is not necessarily communicated to the technical support staff after the device has failed. When the device, which might be a computer or an access server (router), is powered down and then powered back on, the original problem may disappear during rebooting or the conditions, which caused the problem, may no longer exist. Such is the case when a board malfunctions due to overheating and resumes functioning properly once it is cooled. Similarly, an existing problem may not recur immediately after the device is rebooted and may resurface at a later time making the task of troubleshooting (or debugging) more difficult.
Before the occurrence of the failure of the device, the operating system residing and being executed in the device or the software being executed on the device has the most current information regarding the status of various components in the device. Currently, such information is not communicated to the technical support center and remains isolated within the device. The engineers located at a technical support center, based on the status of the device immediately before its failure, could draw valuable insights into the mechanisms of failure and suggest ways of remedying the problem.
If the device is a computer, the operating system or the software within the computer has current information regarding the status of the modern, software updates, status of the hard drive and every other hardware and software subcomponent within the computer. If such information were available to the technical support center, troubleshooting the device could be performed much more efficiently and cost effectively. In addition, since the time duration in which the system is out of service is shortened, the customers making use of the system experience less delay, resulting in a higher degree of customer satisfaction.
Therefore it is desirable to devise a system and method for monitoring the status of a network device at all times and for reporting any problems that may arise in the hardware, software or the interface components of the device to a technical support center so as to rapidly detect a problem with one or more network devices within a large group of network devices. Additionally, the need arises for the monitoring system and method to include the capability to process instructions from the technical support center in order to execute diagnostic tests on the hardware components or request more detailed information from the software subsystems included within the device.