The present invention is generally related to monitoring computer processes in a distributed system. More particularly, the present invention is related to detecting process and network failures in a distributed system.
In recent years, reliable, high performance computer systems have been, and still are, in great demand. Users have also demanded the introduction and propagation of multiprocessor distributed computer systems to support their computing processes (e.g. simulations, parallel processing, etc.). A distributed computer system generally includes a collection of processes and a collection of execution platforms (i.e., hosts). Each process may be capable of executing on a different host, and collectively, the processes function to provide a computer service. A failure of a critical process in a distributed system may result in the service halting. Therefore, techniques have been implemented for detecting a failure of a process in a timely manner, such that an appropriate action can be taken.
A conventional technique for detecting failure of a process includes the use of heartbeats, which are messages sent between processes at regular intervals of time. According to the heartbeat technique, if a process does not receive a heartbeat from a remote process prior to the expiration of a predetermined length of time, i.e., the heartbeat timeout, the remote process is suspected to have failed. Corrective action, such as eliminating the suspected process, may thus be taken.
A remote process not transmitting a heartbeat may not be an indication of a failure in the remote process. Commonly, a process is connected to a remote process through multiple independent networks on which the heartbeats may be transmitted. Furthermore, not receiving the heartbeat from the remote process may be attributed to a failure of one of the networks communicating the processes, rather than failure of the remote process. For example, a network failure may include a network pause (i.e., a temporary condition that prevents communication on a network) or a less temporary network failure, such as a hardware failure for hardware facilitating transmission on the network. A network pause, for example, can be the result of heavy, high-priority traffic over a network link, sometimes caused by other processes (e.g., remote machine backups). If the network pause endures for a period of time greater than the heartbeat timeout or if a network failure occurs, each process waiting for a heartbeat transmitted over the network in the distributed system may suspect the other processes of failing. Then, unnecessary corrective actions, such as eliminating and/or replacing the suspected process from the system, may be taken, which can cause each service facilitated by the processes in the distributed system to be unnecessarily and temporarily halted. Further, if network failures can be detected, appropriate corrective action could be taken, such as establishing connections between the distributed system processes using alternative paths.
An aspect of the present invention is to provide a system and method for detecting process and network failures in a distributed system.
In one respect, the present invention includes a system and method for detecting a network failure in a distributed system. A first process in the distributed system is connected to at least one second process in the distributed system via multiple independent networks. If the difference in the period of time for the first process to receive a heartbeat from the second process on a first network and a period of time to receive a heartbeat from the second process on a second network exceeds a network failure threshold, the second network is suspected of failing.
In another respect, the present invention includes a system and method for detecting a process failure in the distributed system. If the first process fails to receive a heartbeat from the second process on any of the multiple independent networks, the second process is suspected of failing.
The methods of the present invention include steps that may be performed by computer-executable instructions executing on a computer-readable medium.
The present invention provides low cost simplistic techniques for detecting network and process failures in a distributed system. Accordingly, corrective action may be taken when failures are detected. Therefore, down-time for a service provided by the processes in the distributed system may be minimized. Those skilled in the art will appreciate these and other advantages and benefits of various embodiments of the invention upon reading the following detailed description of a preferred embodiment with reference to the below-listed drawings.