Network services or server systems, cloud services, etc. provide computing services to clients. For availability and performance, often such a service may be constructed with multiple servers or machines. The various machines may cooperate to maintain a state that is consistent with respect to the clients or applications that access the service. To this end, servers may each maintain a copy or replica of the state of the service, and updates by one server are applied or performed by the other servers. At some level, each machine is considered equivalent. That is, if each machine is running the same service software and maintaining the same state, each will provide the same output for the same input. This well-known type of system is often referred to as a replicated state machine (RSM). With an RSM, the functionality of a single server is duplicated among a set of some N replicas.
FIG. 1 shows a replication subsystem 100. The RSM in FIG. 1 includes three replica machines 102 which together form the RSM. In practice, many replica machines 102 may be used. The RSM may be implemented using the replication subsystem 100 (i.e., a framework or substrate). For discussion, the RSM may be considered to be the replicated service in combination with the replication subsystem 100. The replication subsystem 100 may perform various replication related functions for the RSM. For example, the replication subsystem 100 may deal with fault tolerance and operation ordering on behalf of the RSM. That is, an RSM, if built in a deterministic manner, may interface with the replication subsystem 100, which may in turn guarantee fault tolerance and operation ordering properties of the RSM when certain properties of the RSM hold true. Regarding the deterministic nature of a service, note that the service may need to be written as a deterministic state machine (a state machine in the classic computer science sense, e.g., with no random or non-deterministic behavior), where distribution of functionality of the servers 104 is left to the replication subsystem 100. In other words, a developer may write a complex state machine and need only be concerned with assuring that the service is constructed to be deterministic; the replication subsystem 100 will be accessed to transparently handle distribution, fault tolerance, and the like.
The replication subsystem 100 may be used as follows. From the perspective of an application developer, the developer writes an ordinary but deterministic server 104 and client 106, where the client 106 sends messages such as operation requests to the server 104, via a network, and the server 104 performs operations and sends reply messages. The replication subsystem 100 operates as local components on the servers and clients, i.e., each server replication component 108 is collocated with a server 104 and each client replication component 110 is collocated with a client 106. The components 108, 110 have respective application programming interfaces (APIs) that the servers 104 and clients 106 use to access the replication subsystem. When running as an RSM, there are multiple instances of the server 104, each sending and receiving updates via the replication subsystem 100 (in particular, by a local replication component 108 of the replication subsystem). A server's local server replication component 108 may maintain a shadow copy of the application state of its server 104. The replication subsystem components 108, 110 cooperate to provide an RSM while preserving the semantic of multiple clients accessing a single server. When a client 106 accesses the RSM, the client's client replication component 110 will communicate with various of the server replication components 108 on behalf of the client 106. When a server 104 performs an operation that affects the state of the RSM, the server's server replication component 108 will coordinate with other server replication components 108 to replicate the operation. As used herein, depending on the context, the term “replication subsystem” may refer to a client replication component 110, a server replication component 108, or both.
To provide fault tolerance, the replication subsystem 100 may implement a consensus protocol. When a client 106 submits an operation to the server system, the replica machines 102 first communicate according to a consensus protocol (via the server replication components 108) to establish the order in which the operation will execute relative to other operations received by the server system or RSM. Then, according to this consensus, the replicated servers 104 each separately execute the operation and the server replication components 108 send the corresponding results to the client machine 112 of the requesting client 106 (specifically, to the client replication component 110 on the client machine 112, which then provides the result to the client 106). It has been proven that, if certain conditions hold, the RSM as implemented by the replication subsystem 100 may experience faults and yet produce results that are identical to those of a single correctly functioning server. In particular, the ordering of all operations from various clients 106 is well defined. That is to say, some level of fault tolerance may be guaranteed by the replication subsystem 100 under some specific conditions.
RSM faults have been divided into two categories: stopping faults and Byzantine faults. By selecting a particular consensus protocol, an RSM can be configured to deal with a particular class of fault. A stopping fault occurs when a replica exits the RSM by loss of connectivity, machine failure, software failure, and so on. A consensus protocol able to handle such a fault is stopping fault tolerant (SFT). A Byzantine fault is a fault that occurs when a replica has failed in a way that renders its behavior incorrect. For example, a replica that has experienced Byzantine failure may produce random outputs, may continue to communicate via the consensus protocol in erroneous ways, may generate random messages, may stop, may act correctly, and so on. If the consensus protocol is not designed for Byzantine fault tolerance, Byzantine faults may cause replicas to become corrupt and clients may in turn become corrupt or fail. However, if a properly functioning Byzantine fault tolerant (BFT) consensus protocol is used, the state of the replicas that have not experienced Byzantine failure will remain consistent and correct, meaning they will advance in the way a single correct machine would advance, and the responses that all clients see will be correct. An example of an SFT consensus protocol was described in a paper titled “The SMART Way to Migrate Replicated Stateful Services” (Jacob R. Lorch, Atul Adya, William J. Bolosky, Ronnie Chaiken, John R. Douceur, and Jon Howell, in the Proceedings of EuroSys 2006). An example of a BFT consensus protocol was described in a paper titled “Practical Byzantine Fault Tolerance” (Miguel Castro and Barbara Liskov, in the Proceedings of the Third Symposium on Operating Systems Design and Implementation (OSDI) '99).
In practice, a client replication component 110 may obtain messages from all replicas, but, if implementing a BFT consensus protocol, will know how to handle corrupt replies and will give the client a single-machine consistent view. For example, a BFT client replication component 110 may resolve conflicting replies from replicas by following a majority of equivalent replies. In sum, if a replica server 104 enters a Byzantine failure state and erroneous or nonsensical messages are received by a client machine 112, the client 106 application above the BFT replication subsystem 100 will see only sane and consistent messages.
As mentioned, an RSM may be guaranteed to tolerate faults under certain conditions (e.g., limited failure), and in practice this may involve the replication subsystem 100 implementing a consensus protocol. The consensus protocol conditionally guarantees operation ordering, meaning that if a first client 106 submits operation1 and a second client 106 submits operation2, and assuming that the fault tolerance conditions are true, either operation1 will be applied to the RSM before operation2, or operation2 will be applied to the RSM before operation1. In either case, there is a strict order in which the operations are applied. This type of ordering property is sometimes referred to as classic ordering, strict ordering, or strong ordering. It has been proven that an RSM configured with an SFT consensus protocol can guarantee ordering and tolerate stopping faults only under the condition that the count of replicas N is greater than or equal to 2F+1, where F is the count of faults. It has also been proven that an RSM configured to tolerate Byzantine faults can guarantee ordering only under the condition that N is greater than or equal to 3F+1.
While SFT and BFT strong ordering guarantees are useful, a previous BFT consensus protocol added unordered or so-called weak operations. That is, this previous consensus protocol provided two types of operation: strong and weak operations. A strong operation, when submitted by a client, is a classic BFT well-ordered operation as described in the preceding paragraph, meaning that when a client submits a strong operation, if enough replica machines are available, the operation completes, strict ordering is guaranteed relative to other strict operations, and the client gets an answer to that effect. If not enough machines are available, the operation fails and is not performed by the RSM. A weak operation was devised that may guarantee replication (i.e., the operation will be applied to the RSM) but does not guarantee ordering with respect to other weak operations or with respect to strong operations. Weak operations are more tolerant of faults. That is, a weak operation needs fewer machines than a strong operation needs to form a sufficient consensus. In practice, however, perhaps due to lack of any ordering guarantees, weak operations have limited practical application. Furthermore, while the strict ordering guarantees of classic BFT and SFT consensus protocols have been established by highly complex and rigorous mathematical proofs, it is possible that such guarantees have not been conclusively proven for the strong-and-weak approach.
Embodiments described herein relate to fault tolerant consensus protocols and implementation thereof.