Distributed computers systems rival and even surpass processing capabilities of supercomputers which represented the state of the art even just a few years ago. Distributed computer systems achieve such processing capacity by dividing tasks into smaller components and distributing those components to member computers of the distributed computer system, each of which processes a respective component of the task while other member computers simultaneously process other components of the task. Larger distributed computer systems promise ever increasing processing capacity at ever decreasing cost.
While distributed computer systems provide excellent processing capacity, such systems are particularly susceptible to computer hardware and software failures. Distributed computer systems have multiple computers with multiple, redundant components such as processors, memory and storage devices, and system software and further include communications media connecting the multiple member computers of the distributed computer system. Failure of any of the many constituent components of the distributed computer system can result in unavailability of the distributed computer system. Accordingly, a very important component of any distributed computer system is the ability of the system to tolerate individual or multiple, simultaneous faults. Such fault tolerance of a distributed computer system makes such a system more reliable than most single computers. Specifically, failure of a substantial portion of the distributed computer system is tolerated and processing by the distributed computer system, while diminished in capacity, continues.
In general, distributed computer systems must meet a number of criteria to properly tolerate faults and to functional adequately. First, all constituent computers of the distributed computer systems, which are sometimes referred to as "nodes," must agree regarding which of the nodes are members of a cluster. A cluster is generally a number of nodes of a distributed computer system which collective cooperate to perform distributed processing. If nodes of a distributed computer system disagree as to the membership of the cluster, nodes can also disagree as to which nodes have a quorum and therefore have access to shared resources and data. The likelihood for simultaneous, inconsistent access of the shared resources and data; and therefore corruption of the data, is great. Second, no single-point failure within a cluster can result in complete unavailability of the cluster. Such susceptibility to failure is generally unacceptable. Third, nodes of a cluster which has a quorum are never in disagreement regarding the state of the cluster. A cluster which has a quorum has exclusive access to resources which the nodes of the cluster would otherwise share with other nodes of the distributed computer system. And fourth, isolated or faulty nodes of a cluster must be removed from the cluster in a finite period of time, e.g., one minute.
Some currently available distributed computer systems can tolerate at most one failure of any node or communications link of the system at one time and can tolerate consecutive failure of every node but one. The ability to tolerate multiple, simultaneous failures in a distributed computer system greatly improves the reliability of such a distributed computer system.