The present invention is generally related to monitoring computer processes in a distributed system. More particularly, the present invention is related to detecting process and network failures in a distributed system.
In recent years, reliable, high performance computer systems have been, and still are, in great demand. Users have also demanded the introduction and propagation of multi-processor distributed computer systems to support their computing processes (e.g. simulations, parallel processing, etc.). A distributed computer system generally includes a collection of processes and a collection of execution platforms (i.e., hosts). Each process may be capable of executing on a different host, and collectively, the processes function to provide a computer service. A failure of a critical process in a distributed system may result in the service halting. Therefore, techniques have been implemented for detecting a failure of a process in a timely manner, such that an appropriate action can be taken.
A conventional technique for detecting failure of a process includes the use of heartbeats, which are messages sent between processes at regular intervals of time. According to the heartbeat technique, if a process does not receive a heartbeat from a remote process prior to the expiration of a predetermined length of time, i.e., the heartbeat timeout, the remote process is suspected to have failed. Corrective action, such as eliminating the suspected process, may thus be taken.
A remote process not transmitting a heartbeat may not be an indication of a failure in the remote process. Instead, a network failure may have prevented a process from receiving a heartbeat from the remote process, especially when multiple processes in a distributed system are communicating over a common network. For example, a network failure may include a network pause (i.e., a temporary condition that prevents communication on a network) or a less temporary network failure, such as a hardware failure for hardware facilitating transmission on the network. A network pause, for example, can be the result of heavy, high-priority traffic over a network link, sometimes caused by other processes (e.g., remote machine backups). If the network pause endures for a period of time greater than the heartbeat timeout or if a network failure occurs, each process waiting for a heartbeat transmitted over the network in the distributed system may suspect the other processes of failing. Then, each process may take unnecessary corrective actions, such as eliminating and/or replacing the suspected processes from the distributed system, which can cause each service provided by the processes in the distributed system to be halted. If network conditions can be detected, appropriate corrective action could be taken, such as establishing connections between the distributed system processes using alternative paths.
An aspect of the present invention is to provide a system and method for detecting and distinguishing between a process failure and a network failure in a distributed system.
In one respect, the present invention includes a system and method for detecting a process failure in a distributed system. A process in the distributed system is connected to a plurality of other processes in the distributed system via a network. If the difference in the period of time to receive a heartbeat from a first of the plurality of processes and a period of time to receive a heartbeat from a second process of the plurality of processes exceeds a process failure threshold, the second process is suspected of failing.
In another respect, the present invention includes a system and method for detecting a network failure in the distributed system. A process in the distributed system monitors a plurality of other processes in the distributed system via a network. If the process fails to receive a heartbeat from any one of the plurality of processes within a network failure time limit, the network in the distributed system is suspected of failing.
The methods of the present invention include steps that may be performed by computer-executable instructions recorded on a computer-readable medium.
The present invention provides low cost simplistic techniques for detecting network and process failures in a distributed system. Accordingly, corrective action may be taken when failures are detected. Therefore, down-time for a service provided by the processes in the distributed system may be minimized. Those skilled in the art will appreciate these and other advantages and benefits of various embodiments of the invention upon reading the following detailed description of a preferred embodiment with reference to the below-listed drawings.