This invention relates to distributed computer systems. A common concern for computer system manufacturers, computer component manufacturers, and computer users is to detect and recover from computer system malfunctions. The malfunctions may arise from a range of causes, such as memory data corruption, data corruption related to fixed disks or removable media, operating system errors, component errors, components overheating, applications or operating systems performing illegal instructions with respect to the processor, incompatibility between various hardware and software system components, and so on.
One class of malfunction is referred to as system “hangs.” A system is “hanged” when the system is no longer able to make progress. Some malfunctions that can result in system hangs include operating systems or hardware components entering an unknown state and not being able to leave that state, causing the operating system or hardware component to cease normal operation. In these cases, the user must restart the computer. Restarting the computer after a system hang can cause problems such as data loss and corruption.
Conventionally, system hangs are detected using timers known as “watchdog” timers. In a typical watchdog timer implementation, a processor periodically resets the timer, and under normal operation the timer never reaches a certain value (or counts down to zero from a certain value). If the timer reaches the certain value, this is an indication that a system hang condition has occurred and the computer system is reset.
In a distributed system, however, application instances running on different machines need to be able to coordinate with each other, such as responding to requests from other machines. If an application in one computer hangs, the application will fail to respond to requests from other computers. As a result, applications running on the other computers may wait forever for the response. Thus, a system-hang in one machine may trigger chain reaction causing the whole distributed system to hang.
A client-server model is common in a distributed system. In the client-server model, a server waits for a request from a client. After a client has sent a request to the server, the client waits for the response to the request from the server. That is, both the client and the server must block itself to wait for external events. Introducing watchdog timers to each client and server in a distributed system would make the system very complex, as the client and servers must periodically wake up from their waiting states in order to periodically reset the watchdog timers in order to prevent the watchdog timers from resetting the respective computers.