The present invention relates to software fault detection. Many network devices rely on continuously running software for proper operation. When a software fault occurs some corrective action must be taken, e.g., a reboot or fail-over to another instance of the software. Before the corrective action is taken, the fault must be detected.
Since failed software cannot be counted on to positively indicate its own fault, some fault detection schemes employ a “watchdog” timer. This can be a hardware timer or a virtual timer provided by an operating system based on a hardware timer. A watchdog timer counts (up or down) to some threshold at which it will trigger an interrupt or other fault response. The software being “watched” is designed to repeatedly reset the watchdog timer so that it does not time out as long as the software is functioning properly.
Setting the time-out period can involve a tradeoff between: 1) allowing for long-duration actions to complete without resulting in a false fault detection; and 2) providing a rapid fault detection so that interruptions in the software's functionality are limited in duration. What is needed is an approach to fault detection that avoids the compromises typically imposed by this tradeoff.
Herein, related art is described to facilitate understanding of the invention. Related art labeled “prior art” is admitted prior art; related art not labeled “prior art” is not admitted prior art.