Due to a program bug or a transient malfunction, execution of a computer program will often get lost or get stuck in a loop. Once this condition is discovered, it is often possible to recover some useful results by a controlled shutdown and restart of the computer. The controlled shutdown and restart is often referred to as a “panic,” and it typically involves saving diagnostic information, closing files during a shutdown of the operating system, and then resetting or re-booting the computer.
Typically a sub-system called a watchdog is used to discover when execution of a computer program gets lost or stuck in a loop. The watchdog may include a hardware interrupt timer that is normally reset periodically by proper execution of the computer program. Upon a failure to reset the interrupt timer within a certain grace period, the interrupt timer activates a non-maskable interrupt (NMI) to the computer. In response to the NMI, the computer executes a non-maskable interrupt routine that performs the controlled shutdown and restart of the computer.
In a multi-threaded, multi-processor system, it is often possible for execution of one code thread to get lost or stuck in a loop while it appears that the other threads are executing normally. However, the improper execution of one of the threads may be corrupting the execution of the other threads. Yet it is rather burdensome to allocate a separate watchdog for each of the code threads and for each of the code threads to be resetting its watchdog.
A system watchdog for a multi-threaded, multi-processor system is shown in FIG. 9 of Jean-Pierre Bono, U.S. Patent Publication 2003/0018691 published Jan. 23, 2003 and entitled “Queues for Soft Affinity Code Treads and Hard Affinity Code Threads for Allocation of Processors to Execute the Threads in a Multi-Processor System,” incorporated herein by reference. Each processor has a hard affinity queue that is serviced only by that processor. Once each second, a watchdog thread is issued to the hard affinity queue of each and every processor. When each processor services its hard affinity queue and finds the watchdog thread, it turns a bit for the processor on within a status variable. In response to a NMI every ten seconds, the status variable is checked, and if any bit is off, the system is shut down and restarted. Thus, this system watchdog detects a failure of any processor to service the watchdog thread in its hard affinity queue within a grace period of about 9 to 10 seconds. A variation of this system watchdog has been used in a system having four processors, in which during each one-second interval, a single watchdog thread is woken up, and then passed in round-robin fashion from the hard affinity queue of one processor to the hard affinity queue of a next one of the processors, and once being passed through the hard affinity queues of all of the processors, the watchdog thread is suspended.