In recent years, reliable, high performance computer systems have been, and still are, in great demand. Users have also demanded the introduction and propagation of multi-processor distributed computer systems to support their computing processes (e.g., simulations, parallel processing, etc.). A distributed computer system generally includes a collection of processes and a collection of execution platforms (i.e., hosts). Each process may be capable of executing on a different host, and collectively, the processes function to provide a computer service. A failure of a critical process in a distributed system may result in the service halting.
Typically, each process in a distributed system maintains information, which may be updated, regarding the configuration of the system as a whole. To this purpose, processes often maintain a “view”, which is a data structure representing the membership of the distributed system (i.e., a set of processes that constitute the system, and each process in the view is a member). It is often required that each process maintains a view consistent with the view maintained by the other processes in the system. All the processes in the system monitor the health of one another, for example, by sending heartbeats on network or internal communication links. This monitoring ability allows each process to update its view of which processes in the system are operational. Additionally, the processes in the system may communicate with each other for maintaining consistent views of the system.
A view change is a representation of membership that reflects the addition of a new member (i.e., a new process) or the removal of a current member from the view. For example, if the first process suspects that another process has failed, the first process can request a view change. Commonly, a technique for maintaining consistency between the processes includes the processes in the system voting on a view change. The majority of the members of the system must vote for a new view for it to be adopted.
A simple example of a distributed system is discussed herein to illustrate these concepts. One technique for minimizing the risk of a failure of a critical process in a distributed system includes implementing a fault tolerant system. A simple fault tolerant system may include three processes. Two of the processes are mirrors that provide system redundancy, which makes the system fault tolerant. If one mirror fails, the other has the ability to perform the role of the failed mirror in the system. A third process (called the “witness”) is not a mirror and acts as a tie-breaker for view-consensus algorithms. If the third process agrees to a view change presented by one of the mirrors, the view change may be adopted by the system.
A distributed system may include members that have specific “roles”. In the above example, two members may be mirrors of each other and provide users with a service, and the third witness-member may perform the function of maintaining view consistency.
While fault tolerant systems typically replace members due to the failure of a process, it may be desirable to replace a member that has not failed, hereafter referred to as the “victim”. A victim may be used for the following situations: when the system is reconfigured to optimize performance; when a failed communication path requires a new process to be added via another communication path; when a host is removed from service for maintenance purposes, and the like.
A process may be replaced by killing the victim, which causes the system to repair itself by automatically replacing the victim. FIG. 5 illustrates a conventional method for killing a victim and merely relying on the system to replace the lost process through conventional failure detection and auto-repair ability.
In FIG. 5, view 505 illustrates normal operation of a three member system 500, including members (e.g., processes) 510, 520 and 530. Members 520 and 530 are mirrors. View 540 illustrates a first view change, including the removal of member 530. In view 540, member 530 is terminated (i.e., the victim is killed) and then removed from the view. In view 540, the system 500 is no longer fault tolerant, because no mirrors are available. Therefore, if member 520 fails, the computer service provided by system 500 may be halted. View 560 includes a second view change. In view 560, a new member 550 is added to the system for replacing member 530.
The process illustrated in FIG. 5 is “non-atomic”, because more than one view change is required to replace a process. Because the system requires all three members in order to remain tolerant of failures, system 500 will not be fault tolerant during the interval after the victim has been killed and before the victim has been replaced. This temporarily exposes system 500 to the risk of system failure and halt in service. To avoid this risk, it is desirable to create a replacement member before killing the victim.