1. Field of the Invention
The present invention relates to processor based systems in general and in particular to processor based systems whose services are required all of the time and are referred to in the document as High Available Systems.
2. Prior Art
Today""s society is so dependent on technology that any glitch or failure in the associated machines or systems could result in catastrophic consequences.
Communications is one of the technological areas of uttermost importance. Communications include interconnecting networks, such as the internet, and the machines and/or subsystems that are connected thereto. The communications area provides the mechanisms for users to communicate via desktop systems such as computers, word processors etc.
In addition to the desktop systems, there are infrastructure systems that are required to facilitate the interconnection and/or provide shared services. The systems are processor based and may include devices such as web servers, printers, networking routers, switches, pbx, phone systems etc. Such systems are placed at high availability locations and must run continuously or if a hardware failure occurs recover from it automatically.
Even though it would be desirable to have fail safe systems, this is not possible in the real world. Machines do break down and when one does the next best thing is to be able to troubleshoot a failed system to detect the cause of the failure and to make sure the same problem does not affect the system repeatedly.
To meet the performance goal of these processor based systems manufacturers provide them with a bundle of RAS (Reliability, Availability and Serviceability) functions. The RAS function include a hardware reset feature that handles hardware failures without intervention from an operator.
To provide the hardware reset RAS function a single watchdog timer circuit is placed within the processor based system. Its sole function is to restart the system if it detects a hang or failure. Most watchdog timer systems work in a mode where the microprocessor must periodically interrupt the watchdog timer by some sort of read or write operation. This operation will restart a countdown timer contained within the watchdog timer circuit. If this timer reaches zero (or some value if it counts up), it will generate a signal which will reset the processor and all of its support components. This only occurs if and only if the microprocessor has not interrupted the watchdog timer within a set time interval. In High Availability microprocessor based subsystems such as networking hardware, servers and the like, especially when the unit is not in continual human contact, this restart is necessary to bring the box back online. User intervention to restart a hung subsystem is undesirable since the system may be inaccessible, or will require excessive time to get human intervention to occur.
Even though the single watchdog timer works well for its intended purpose, it is plagued with several problems. First when a microprocessor based system is rebooted on a watchdog-timeout the reboot is very destructive to the contents of memory, registers and microprocessor stack contents, preventing the software from logging what may have gone wrong or why the system had to be rebooted.
Another problem with the single watchdog timer is that it is only effective with hardware associated failures. Quite often the failures that cause microprocessor based systems to hang are software related. For example, software bugs or a runaway pointer in memory could cause the system to get a wrong instruction and then lock up. Even hardware errors such as a bad memory location can cause a system hang, forcing a lock up. For these types of errors the single watchdog timer circuit is ineffective.
The services of a trained technician is required to troubleshoot and identify the problems that cause the lock up. A lot of expensive equipment and technician time are also required to identify and correct the problem. Even with well trained technicians and sophisticated instruments sometimes the condition that causes a hung system cannot be replicated. The solution is to discard the unit as being defective. The cost associated with abandoning the unit or troubleshooting to identify the cause of the error can be prohibitively high and unacceptable.
In view of the above there is a need for a RAS (Reliability, Availability and Serviceability) system that solves the problems of the prior art single watchdog timer system. The present invention (set forth herein) provides such a system.
The RAS system of the present invention includes cascaded watchdog timer circuits. The first watchdog timer circuit trips on the microprocessor inactivity. The first watchdog timer circuit generates a non-maskable interrupt (termed a Soft Reset or system management interrupt) to the processor. The interrupt wakes the processor enough to recover from the hung state and logs the current status of the system. The processor logs the contents of memory for later analysis, copies the register stack to help programmers determine where the system hung, and also finds which interfaces or devices may have been involved in the hang by querying registers. All of this will improve the software and hardware designer""s ability to find bugs and the user""s ability to detect configuration errors. Once this logging is complete, the system allows the second watchdog timer to expire and generates a hard reset signal that is used to reset the hardware. If the processor does not wake after the first watchdog timer triggers the non-maskable interrupt or system management interrupt, due to a very complex hang, the second watchdog timer will trigger anyway on inactivity and then reset the box with a hard reset.
Hardware circuits are provided to ensure that the cascaded watchdog timer circuits are activated sequentially, reset on a reset condition and inhibit race condition when the reset is generated.
Software is also provided to handle intermediate system query step to interrogate and log hung system status.