1. Technical Field
The present invention relates generally to failure detection and fencing in a computing system, and more particularly to detecting and fencing off a failed entity instance so that failover time in the computing system is reduced.
2. Discussion of Related Art
Many computing systems maintain important state information on persistent media. For example a database system typically stores the actual data of the database in persistent media. When such systems experience a critical failure, it is desirable to restart the system as quickly as possible, either on the same host computer, or on a different computer that has access to the same persistent media. The overall goal is to minimize the period of time that the system is unavailable for use by end users (known as “failover time”). Mechanisms used to automate this failure recovery include polling techniques to determine if the system is healthy, and event-based techniques that generate an event if the system experiences a critical failure.
An example of a polling-based technique is one that regularly sends status requests (e.g., “are you alive” messages) to the system in question, and waits for a certain amount of time (a timeout period) to receive a positive response. If a positive response is not received within the timeout period, the system is declared failed, and additional actions are taken such as re-starting the system (e.g., starting a new instantiation of the system). An example of an event-based mechanism is registering a handler for the SIGCHLD signal that is generated when a child process fails, and, in the handler, initiating the restart of the system. These techniques have significant drawbacks, for example, polling requires extra CPU cycles to perform the polling (e.g., the CPU cycles associated with sending and responding to the messages), and experiences a significant delay in the detection of the death of the system, and event-based techniques may experience significant delays before the event is generated. For example, a SIGCHLD is typically not generated until all diagnostic information necessary for root cause analysis of the error (e.g., core dumps and other dumps) are complete, which can result in a delay of several seconds or more.