A typical distributed computer system includes several interconnected nodes. The nodes may communicate by the use of messages, shared memory, etc. Through communication, nodes in a distributed computer system are able to provide greater functionality. For example, the distributed computer system may be used for communication between users, solving computationally hard problems, dividing tasks between nodes to provide greater throughput (e.g., a web server accessing a database server), etc.
Occasionally, one or more nodes in the distributed computer system may fail. The failure of a node may be attributed to the hardware failing or the software failing. For example, hardware failure may occur, when the processor crashes, upon failure to transmit data (e.g., transmit messages, data to and from storage, etc.), etc. Likewise, software failure may occur when the module of an application or the operating system fails to execute properly (or as expected). For example, an application or operating system thread may execute an infinite loop, several threads waiting for resources may cause deadlock, a thread may crash, etc.
Managing a distributed computer system when failure occurs involves detecting a failure has occurred and recovering from the failure before the entire distributed computer system crashes. Detecting when a failure may occur involves determining whether the nodes in the distributed computer system are still operating.
One method for detecting a failure in hardware includes having each node in the distributed computer system send a heartbeat messages to the other nodes at regular time intervals. Using the heartbeat method, at the specific time interval, the currently executing process is swapped out and the node sends the heartbeat.
Another method for detecting a failure in hardware is for a master node to send a stop thread call with a command to respond to a child node. When the child node receives the stop thread call, the child node immediately stops whatever thread is executing, regardless of whether the thread is in a critical section, and responds to the master node indicating that the child node has stopped. Accordingly, when the master node receives the message, the master node is assured that the hardware on the child node is executing well enough to send the response.
Detecting a failure in the software executing on the node requires a method for determining whether a thread is executing an infinite loop, a deadlock is resulting from several threads waiting for resources, a thread crashed, etc. Accordingly, detecting failure of the software can be more challenging than detecting failure in hardware.