A typical computing system may include one or more conventional processors and one or more conventional watchdog timers to provide a “sanity check” for the system, restoring the system to a known condition should one or more of the processors fail. For a single processor system, a presumably “sane” processor will periodically reset the watchdog timer before the timer times-out. However, should the timer time-out because of a fault in the processor, the processor is typically reset and the processor executes recovery software, reestablishing normal operation.
When a multiprocessor system has one or more watchdog timers associated with each processor, system instability might occur should one (or more) watchdog timers time-out. The instability occurs because once the failed processor is reset, the remaining processors may operate incorrectly (e.g., they become “hung”) waiting for a response from the failed processor, which, in turn, causes watchdog timers corresponding to the hung processors to time-out, causing other processors to hang, etc.
Therefore, it is desirable to provide a multiprocessor system having watchdog timers respond to a failed processor in a controlled, systematic way.