The present invention is generally related to utilizing an adaptive threshold to detect a failed process in a distributed computer system.
In recent years, reliable, high performance, computer systems have been, and still are, in great demand. Users have also demanded the introduction and propagation of multi-processor distributed computer systems to support their computing processes (e.g., simulations, parallel processing, etc.). A distributed computer system generally includes a collection of processes and a collection of execution platforms (i.e., hosts). Each process may be capable of executing on a different host, and collectively, the processes function to provide a computer service. A failure of a critical process in a distributed system may result in the service halting. Therefore, techniques have been implemented for detecting a failure of a process in a timely manner, such that an appropriate action can be taken.
A conventional technique for detecting failure of a process includes the use of heartbeats, which are messages exchanged between processes at regular intervals of time, for example, on a network link or between a set of interfaces relegated to an exchange of internal control messages. Two methods are commonly used. The first is a request-response scheme. A sending process may send out a heartbeat (hereinafter referred to as a xe2x80x9cpingxe2x80x9d) to which it expects a response (hereinafter referred to as a xe2x80x9cpongxe2x80x9d) from a remote process (e.g., another process in the same group) in a distributed system. The sending process measures the time interval between issuing the xe2x80x9cpingxe2x80x9d and receiving back the xe2x80x9cpongxe2x80x9d. This time interval is the heartbeat arrival time. According to this technique, if the sending process, expecting a xe2x80x9cpongxe2x80x9d, does not receive it prior to the expiration of a predetermined length of time, i.e., prior to the expiration of a heartbeat timeout, the remote process is suspected to have failed. Corrective action, such as eliminating the suspected process, may thus be taken.
A second method is a heartbeat stream scheme. For example, a sending process A, sends a sequence of heartbeat messages to a receiving process B. Process B measures the time interval between receiving successive heartbeat messages from process A. This time interval is the heartbeat arrival time for the second method. If process B does not receive a heartbeat prior to the expiration of a predetermined length of time, i.e., prior to the expiration of a heartbeat timeout, the remote process (e.g., process A) is suspected of failing. Once more, corrective action, such as eliminating the suspected process, may thus be taken.
The common feature in these two schemes is that a heartbeat arrival time is measured. If this time exceeds a specified heartbeat timeout, corrective action is taken. The present invention applies to both these schemes.
The length of time for a heartbeat to travel between processes may vary based on a variety of conditions, such as system or network load, local area network (LAN) pauses, and other transient events. The heartbeat timeout may be soft-tunable, i.e., a system administrator can set the heartbeat timeout to an appropriate length of time for a particular network or application. However, a system administrator may need to continually monitor a network on which the heartbeats are transmitted and other factors affecting the transmission of heartbeats for determining the appropriate length of time for the heartbeat timeout. Furthermore, the conditions affecting the transmission of heartbeats may change frequently. Therefore, the system administrator may need to change the length of the heartbeat timeout frequently to account for transient conditions that may affect the transmission of heartbeats. Furthermore, conventional techniques for monitoring processes in a system, such as the two schemes described above, generally do not account for transient conditions that may affect transmission of heartbeats.
In one respect, the present invention includes a method including the steps of (1) receiving a heartbeat from a process in the distributed system; (2) determining whether a heartbeat arrival time of the received heartbeat is less than a first heartbeat timeout; and (3) adjusting the first heartbeat timeout in response to the heartbeat arrival time being less than the heartbeat timeout.
The method further comprises steps of recalculating the first heartbeat timeout, such that the recalculated heartbeat timeout is less than or greater than the first heartbeat timeout. Then, adjusting the first heartbeat timeout to be equal to the recalculated heartbeat timeout. The recalculated heartbeat timeout is based on the heartbeat arrival times of one or more heartbeats
The method of the present invention includes steps that may be performed by computer-executable instructions recorded on a computer-readable medium.
In still another respect, the present invention includes a distributed system having a plurality of processes in communication with each other. The distributed system includes a first host executing a first process of the plurality of processes; a second host executing a second process of the plurality of processes; and at least one communication path connecting the first and second host. The second process is operable to transmit a series of heartbeats on at least one communication path to the first process. The first process is capable of monitoring the second process based on the received series of heartbeats and based on an adjustable heartbeat timeout associated with a period of time for receiving a heartbeat from the second process before suspecting a failure of the second process. The first process is further capable of adjusting the heartbeat timeout in response to receiving a heartbeat in the series of heartbeats prior to the expiration of the heartbeat timeout.
It is desirable not to have a heartbeat timeout that is too long, resulting in an extended period of time in which a process failure goes undetected. On the other hand, when the heartbeat timeout is too strict, a process may be improperly suspected of failing and corrective actions may unnecessarily be taken in response to improperly suspecting a failed process. The present invention provides an adjustable heartbeat timeout that may be adjusted based on observed conditions and which minimizes improperly suspecting a failed process.
Those skilled in the art will appreciate these and other advantages and benefits of various embodiments of the invention upon reading the following detailed description of a preferred embodiment with reference to the below-listed drawings.