A watchdog timer is a standard mechanism employed to detect and respond to system failures in a timely manner. The watchdog timer reacts to a failure by executing a specific routine after the expiration of a set time period, unless it is reset by software. A watchdog timer is typically utilized to handle situations in which a computer locks up, halts, hangs or is in an infinite loop. A watchdog timer is started or initialized with a time out value. Then, the time out value of the watchdog timer is decremented at a certain frequency by a decrement value until the timer reaches zero, indicating that an error has occurred. During normal operation, a watchdog timer is reset, typically at regular intervals, prior to reaching zero. On being reset, the watchdog timer is reinitialized or reset to the time out value. However, if the timer is not reset prior to reaching zero, the watchdog timer is triggered indicating that an error has occurred. Generally, a recovery procedure is then initiated to recover from the error.
Watchdog timers are typically employed in computer systems to detect errors and/or assist in recovering from errors. For example, a watchdog timer can be employed to detect and recover from application errors. If an application locks, halts, or is otherwise non-responsive, the timer will not be reset, which will consequently cause a previously set period of time to expire and a response to be triggered. Typical examples of watchdog-initiated responses include an interrupt, a warm boot (system reset), or a system shutdown.
Watchdog timers utilized in computer systems are often implemented as a separate retriggerable hardware timer attached to a processor's reset line. These hardware based watchdog timers are typically integrated into computer systems on Peripheral Component Interconnect (PCI) or Integrated Standard Architecture (ISA) cards connected through their respective slots on the motherboard, or alternatively, made to operate outside the computer by way of an external serial device. These timers are connected to a computer's reset line and initiate a reboot when the timer is triggered (i.e., counts down to zero). Although this approach enables recovery from system lock-ups, it can result in complete system reset, which typically involves a significant delay in waiting for the system to reboot. Generally, these hardware based watchdog timers are unable to interact with an operating system thereby limiting their applications to computer systems.
Another approach to employing watchdog timers in computer systems is to utilize conventional system timers to implement watchdog timers. However, these system timers require a relatively large amount of time to program and operate and utilize significant system resources. Additionally, these system timers generally are only accessible via I/O registers, which is inefficient for accessing and programming the timers. Also, these system timers are often utilized for existing applications and are, thus, unable to be employed by software components such as the operating system or applications.
Another shortcoming of conventional watchdog timers is that they have limited time out values. For example, 32 bit based watchdog timers operating at typical system bus speeds are limited to time out values of about seven minutes. Such a limitation renders these watchdog timers unusable for applications requiting longer time out values, such as when booting a series of large servers many of which require an hour or more to boot.
Thus, watchdog timers can be effective in detecting and recovering from errors encountered in computer systems. However, conventional watchdog timers are either too expensive or require significant system resources. Specifically, hardware based watchdog timers are costly and can be limited in recovery procedures, yet standard system timers require too many system resources.