This disclosure relates generally to the field of computer synchronization, and more particularly to continuing operation of a quorum based system after failures.
A distributed computing system uses software to coordinate tasks that are performed on multiple computers simultaneously. The computers interact over a network and send each other messages to ensure they can communicate. When the network becomes partitioned, or when a computer fails, some or all of the computers no longer can communicate and become isolated. Processes running in the different network partitions may continue to operate separately, thereby producing inconsistent results.
The configuration of a distributed computing system includes the number of processes, replicas, or computers that must maintain communications, referred to as a quorum. The quorum represents the number of participants required to agree on a result before it can be applied to all the participants. This ensures that only the network partition that has a quorum will be able to continue to process requests. The remaining processes cannot form a quorum and thus will not be able to continue processing requests. With lack of quorum, the distributed system becomes non-operational until quorum is restored. However, a prompt restart of failing processes is not always possible, especially if the processes failed due to a hardware problem. Thus, a quorum based system may be non-operational for an extended period of time.
Allowing quorum based systems to remain operational may enhance availability while quorum is recovered.