1. Field of the Invention
The present invention generally relates to watchdog timers for personal computer systems. More specifically, the preferred embodiment relates to the use of a software watchdog timer to monitor the uptime of individual applications running on a computer system.
2. Background of the Invention
Watchdog circuits are rather common in modem computer systems. A watchdog circuit is one way of creating a stable computing platform. In fact, when one speaks of a stable, robust computer system, the watchdog circuit is indirectly one of the reasons that the system has these attributes. Computer designers rely on the watchdog circuit to reset the system in the unfortunate event something goes wrong. If a computer system hangs or locks up, the watchdog circuit can perform a number of tasks, including logging error information, checking memory, and rebooting the system so the computer will be up and running again in a short amount of time.
A watchdog circuit typically is a timing circuit that measures a certain system activity or activities. If the system activity does not occur within a prescribed timer period, the watchdog circuit generates an output signal indicating that the activity has not occurred. In its simplest form, the watchdog timer insures that the system is operational. Modem watchdog circuits are capable of performing a variety of tasks, but the heart of a watchdog timer is essentially just a counter. The timer continually counts up or down using the system clock towards a predetermined value until one of two things happen. First, the counter can be cleared so that the amount of time required to count to the predetermined value is pushed back to the maximum value. For example, if a timer counts from a maximum value of 300 seconds towards a minimum value of zero seconds, then when the timer is cleared, the clock will revert back to the maximum value and continue counting down from 300 seconds. The clear command (sometimes referred to as “hitting the watchdog”) is typically issued by the operating system (OS). Programmers will insert commands in the OS code instructing the OS to periodically hit the watchdog. Thus, as long as the OS is operating as intended, the watchdog timer will be cleared periodically and the timer never reaches the predetermined value.
The second thing that may happen as the watchdog timer is running is that the counter actually does reach the predetermined value. This obviously occurs if the watchdog is never hit and the timer is never cleared. In this case, the watchdog timer will issue a reset command to the system and the computer will reboot. This type of automatic recovery is particularly helpful in unmanned computer systems. Obviously, if a user is working at a computer system and the OS becomes unresponsive, the user can initiate the reset procedure themselves. If, on the other hand, the computer is generally unmanned and working as a server in a computer network, it may not be readily obvious that the computer has ceased normal operations. The first person affected by such a condition will likely be a network user who discovers that they can't access a network database or perhaps their email. Thus, if a server becomes inoperative, the watchdog timer guarantees that the system will be up and running again in a short amount of time.
In their present configuration, conventional watchdog timers are certainly useful for their intended purpose. However, there are a number of drawbacks that can be improved upon by a more modem approach. From the perspective of server customers, the health of the OS is not necessarily the most important aspect of a network server. More often than not, a server actually exists to run a specific application and the proper operation of that application is the most important goal for the customer. Thus, if the key application or applications cease operation, but the OS effectively continues, the system will never reset and the customer experiences unwanted downtime.
Another problem with conventional systems is that the fix for a system lock-up is a full system reset or reboot operation. A more efficient solution to this problem is to first restart the failed application. The time required to end an application process and subsequently restart of that application is much less than the time required to reset the entire system. If the application is successfully restarted, the end result is a decrease in downtime. If, however the OS is unresponsive as well, the conventional watchdog timer will still recover the application by forcing a system reset. In either case, the minimum required downtime is achieved.
It is desirable therefore, to develop an application-level watchdog timer that is capable of monitoring key applications and reviving those applications in the event the applications become unresponsive. The application-level watchdog timer may work in conjunction with a system level watchdog timer to provide a staggered level of protection that may advantageously improve computer server uptime.