The present disclosure pertains generally to distributed computing systems and, more particularly, distributed (replicated) data store systems comprising of strongly consistent data store replicas. More precisely and technically, the present disclosure relates to a fault-tolerant data processing computer system and method for implementing a distributed (replicated) two-tier state machine, in which consistency among processes (devices) is maintained despite the failure of any number of processes (devices) and communication paths. The two-tier state machine can be used to build a reliable distributed (replicated) data store system and also other distributed computing systems with modest reliability requirements that do not justify the expense of an extremely fault tolerant, real-time implementation.
From the computer architecture point of view, a distributed data store system is a middleware that can be used by the application programmers to develop any kind of distributed applications. It generally consists of a set of computers, each equipped with a local data store, primitive operations for reading and writing to the local data store, and a protocol for ensuring synchronization among the computers being tailored to the desired functionality (e.g. all or only some data are replicated). A data store is a repository of a set of data objects. These objects are modeled using classes defined in a database schema. A data store is a general concept that includes not just repositories like databases, but also simpler store types such as key-value data repositories, flat files, etc. The programmers can use the primitive operations for reading and writing to the local data store to implement transactions, i.e. blocks of code that operate on the data store with the desired safety properties. A distributed data store system facilitates development of distributed applications, since the programmer has only to implement the application handlers that handle client requests and the application transactions that operate on the store (as required by the requests). Applications can be modified without redesigning the underlying middleware. Moreover, if the underlying data store system can tolerate failures, it is also much easier to develop robust applications. In particular, a fully replicated data store system can continue to provide service even if some of its replicas have crashed and are not recovered yet.
In a system comprising a distributed data store and a client application, there are a number of server computers (servers) connected together in a network in which the servers can send messages to each other. Each server has access to a local data store kept in stable storage that can survive server crashes. On every server, there are many concurrent processes processing client requests and returning responses to the clients. Processing a client request means translation of the request into a transaction that executes some code and returns a result to the client. To increase system robustness and availability, a local data store can be replicated, that is every local store (replica) contains an exact copy of data. Then a client gets the same response no matter which server will process the request. In particular, if a given server is down or slow and does not respond, a client can resubmit its request to another server. In practice, a crashed server can be recovered, meaning that the server is restarted and its state is caught up with the other servers to reflect the current state.
Conventional approaches to implementing fault-tolerant distributed data store systems require some synchronization protocols for maintaining consistency among replicas. However, the synchronization protocols designed in accordance with the prior art have several drawbacks, as explained below.
The two-phase commit protocol (2PC) (described in: Jim Gray. Notes on data base operating systems. In Operating Systems: An Advanced Course, volume 60 of Lecture Notes in Computer Science, pages 393-481, Berlin, Heidelberg, New York, 1978. Springer-Verlag.), a popular consensus protocol known from distributed database systems, generally assumes a single process (a leader) that coordinates all processes. In the first phase, a leader attempts to prepare all the processes to take the necessary steps for either aborting or committing transactions and to vote for a commit or abort. In the second phase, the leader decides to commit the transaction (if all processes have voted for commit) or abort (otherwise). The protocol is not resilient to all possible failure configurations and it is a blocking protocol. After a process has sent a message to the leader, it will block until a commit or rollback is received. If the leader fails permanently, some processes will never resolve their decisions. If both the leader and some process failed, it is possible that the failed process accepted a decision while other processes did not. Even if a new leader is elected, it cannot proceed with the operation until it has received a message from all processes and hence it must block.
The three-phase commit protocol (3PC) (described in: Dale Skeen and Michael Stonebraker. A formal model of crash recovery in a distributed system. IEEE Transactions on Software Engineering, SE-9(3):219-228, May 1983.) is more resilient to faults than the 2PC protocol. It avoids permanent blocking by introducing additional phase, in which the leader sends a preCommit message to other processes. The leader will not send out a decision message (abort or commit) until all processes have acknowledged the preCommit message. The protocol places an upper bound on the amount of time required before a transaction either commits or aborts. This property ensures that if a given transaction is attempting to commit via 3PC and holds some resource locks, it will release the locks after the timeout. Thus, the protocol can make progress in case of failures. However, the original 3PC protocol does not take into account every possible mode of failure. In particular, it is only resistant to node crashes and is vulnerable to e.g. network partitions. A network partition is a failure of the network device that causes a network to be split, so that some processes are not able to communicate.
The enhanced three-phase commit protocol (E3PC) (described in: Idit Keidar and Danny Dolev. Increasing the Resilience of Distributed and Replicated Database Systems. Journal of Computer and System Sciences, 57(3), 309-324, December 1998) alleviates the aforementioned shortcomings of 3PC by introducing a quorum-based recovery phase. However, even though processes are not blocked indefinitely by a failure of some process or a network partition, a transaction's commitment may be significantly delayed. This is because, as in the 2PC and 3PC protocols, a transaction can only commit when all processes accept it. If failures occur, processes may invoke the recovery procedure and elect a new coordinator. If the recovery procedure fails (e.g., due to the crash of some process), it is retried until it will eventually succeed. The final decision on whether to commit or abort a transaction can only be made when the system is fully recovered.
A state machine approach (described in: Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM (CACM), 21(7):558-565, July 1978.) is another popular method that can be used to implement distributed data stores and other computing systems that must tolerate failures. A state machine generally consists of a set of states, a set of commands, a set of responses, and a functor that assigns a response/state pair to each command/state pair. A state machine executes a command by changing its state and producing a response, with the command and the machine's current state determining its new state and its response. A state machine can be replicated, as illustrated in FIG. 1. Then, all state machines start from the same state S0 and execute exactly the same sequence of commands c1 . . . ck+1 (k>0). A distributed computing system consists of several component processes (devices) that are connected by a network. In the distributed state machine approach to building fault-tolerant systems, the component processes (devices) are replicated and synchronized by having every process P1 . . . Pn independently simulate the execution of the same state machine. The state machine is tailored to the particular application, and is implemented by a general algorithm for simulating an arbitrary distributed (replicated) state machine. Problems of synchronization and fault tolerance are handled by this algorithm. When a new system is designed, only the state machine is new.
If additional assumptions are made about the relation between state machine commands, an algorithm implementing a distributed (replicated) state machine can be designed to reflect that relation and to improve performance. For example, commands that have a commutative relationship can be executed in an arbitrary order, thus a state machine could refrain from requiring that all processes obtain all commands in the same order. For example, consider a distributed computing system for maintaining bank accounts of customers. Some actions of different clients can be translated to the state machine commands that commute with one another. E.g., if a client c1 issued a request to deposit $100 into its account at approximately the same time when a client c2 issued a request to withdraw $50 from its account, either command could be performed first, without affecting the final state of the distributed state machine. A method and system for implementing a fault-tolerant distributed state machine that supports commutative commands were described in the European patent EP1659500. However, the approach presented in EP1659500 is not much different from the original state machine, since even though different processes (devices) may obtain the same commands in a different order, the commands still have to be executed sequentially. Moreover, it lacks a general method of deciding by the distributed state machine whether two commands are commutative or not.
Paxos (originally described in: Leslie Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16(2), May 1998, and in the U.S. Pat. No. 5,261,085 under the name of Multiple Command Protocol) is the most popular algorithm for implementing arbitrary state machines. It was successfully used in many practical distributed applications. The general idea of the Paxos protocol can be explained as follows. The state machine commands are chosen through a series of numbered ballots, where each ballot is a referendum on a single command. The state machine commands are numbered consecutively. One of the processes (devices) in the network is designated as a leader, and it sends ballots with proposed commands to the other processes (devices). In each ballot, a process has the choice of either voting for the proposed command or not voting. A process does not vote if it has already voted in a higher ballot. Obviously, a crashed process also does not vote. In order for a ballot to succeed and a command to be issued, a majority set of the processes in the system must vote for it. If less than majority of processes voted for a command, then another ballot has to be conducted. Therefore, a single command can be voted in several ballots. Each ballot is given a unique number, and the majority set is chosen in such manner that the majority sets voting on any two ballots will have at least one process in common (in fact, any two majority sets have at least one process in common). Thus, any command which has been issued will appear in the store of at least one process of any majority set participating in a subsequent ballot. Each issued command is processed by a separate instance (execution) of the protocol. Protocol instances (executions) and issued commands are numbered using natural numbers. An instance n denotes the n'th instance (execution) of the protocol which corresponds to the issued command number n. When a new leader is chosen, messages are exchanged between the new leader and the other processes in the system to ensure that each of the processes has all of the commands that the other processes have. As part of this procedure, any command for which one of the processes has previously voted but does not have a command number is broadcast as a proposed command in a new ballot. The protocol allows a leader to conduct any number of ballots concurrently by running a separate instance of the protocol for each command number.
In the simplest state machine approach, a distributed data store system is implemented with a network of servers that transform transactions into commands of a distributed state machine. Any algorithm used for simulating a distributed state machine, ensures that all servers obtain the same sequence of commands to be executed sequentially, thereby ensuring that they all produce the same sequence of state changes—assuming they all start from the same initial state and the state machine is deterministic (i.e., given the same input it produces the same output). Therefore strong consistency is ensured and network communication is modest (since only commands have to be broadcast). However, in general transactions cannot be executed concurrently on a server (since they must produce the same results on all servers), which does not allow the system to fully utilize the performance of modern multi-core architectures.
In the database state machine approach to building a distributed store system (described in: Fernando Pedone, Rachid Guerraoui, and André Schiper. The database state machine approach. Distributed and Parallel Databases, 14(1):71-98, July 2003), a distributed state machine is only used for transaction commitment. In a distributed (replicated) data store built using this approach, transactions can be executed concurrently, but a transaction commitment procedure is transformed into a state machine command. The command performs two tasks: (1) it decides whether to commit or abort a finished transaction based on updates and other data about transactions (this task is called certification), and (2) it applies the updates to the data store in case of successful certification—otherwise the transaction is aborted. That command is executed, and the state machine response is transformed into a reply to the application, which is sent to it by the server that executed the transaction. The state machine commands are executed sequentially, as in the original state machine approach. Since all servers perform the same sequence of state machine commands, they all maintain consistent versions of the state machine state (which is kept in the local data stores). However, at any time, some servers may have earlier versions than others because a state machine command is not always executed at the same time by all servers.
A distributed data store utilizing the database state machine approach allows for strong consistency and non-blocking concurrency, but it has drawbacks. Firstly, the network communication is not optimal, since the updates and other data of every transaction (which can be large) must be communicated to all servers irrespective of whether this transaction will be decided to commit or abort. This is because these data are required by the first task of the transaction certification procedure performed by the state machine on every server. Secondly, solutions based on selecting one dedicated process to carry out this task (and thus eliminating redundant certification on other servers) resemble the 2PC or 3PC protocols, so have their drawbacks.
Therefore, there is a need to develop a system and a method for implementing fault-tolerant distributed data stores and distributed computing systems utilizing a similar model of computation that will be free from the above drawbacks. The key idea of such a system and method can be explained using a two-tier state machine, which extends the notion of a general state machine in the following way.
A two-tier state machine is a state machine, equipped with a set F of functions, that are intended to be called only by one process (device) which is considered by the other processes (devices) as a leader. Functions return commands intended for the state machine. Functions can be nondeterministic (may return different results each time they are called) and can be executed concurrently. Functions may transform a leader state LS that is associated with a leader process that executes the functions, where LS is separate from a machine state MS of the state machine. Given two functions f and g, the execution of g logically depends on the execution of f (or, g depends on f, for brevity) if the state transformed by g depends on the state transformed by f, with no other function intervening in between and accessing the state of f or g. Given two commands d1 and d2, d2 depends on d1, or in other words d1 precedes d2, denoted d1=>d2, if they were returned by, respectively, functions f and g such that g depends on f. A null command is an abstract command that has no precedent command. A sequence of commands is dependent if given any two commands d1 and d2, such that d1 is directly followed by d2 in this sequence, d1 precedes d2, and the first element of the sequence is the null command. The commands that have been issued for the execution by the two-tier state machine can be executed concurrently with functions, and the following two conditions hold: (1) all the issued commands form a dependent sequence of commands; (2) the state machine must execute a prefix of the dependent sequence of the issued commands with no intervening command in between.
A distributed two-tier state machine can be implemented trivially as an ordinary distributed state machine, by having each function executed by the state machine, and requiring that the result of function execution (a command) is executed by the state machine before any other function can be issued for execution by the state machine. However, this brings no more advantages over a common state machine and requires functions to be deterministic. On the other hand, any naive implementation utilizing a general state machine algorithm to issue commands, in which functions are executed externally by some dedicated process (device) and the order of issued commands is not constrained by the functions returning the commands will be incorrect. This is because the general state machine algorithms (such as Paxos and its variants, e.g. described in the patent publications U.S. Pat. Nos. 5,261,085, 7,565,433, 7,856,502, 7,558,883, and EP1659500) are not able to ensure that the sequence of issued commands is dependent. Moreover, as the concurrent execution of commands and functions is not constrained, the execution of functions can intervene the execution of the sequence of issued commands, thus leading to inconsistencies among local data stores.
Thus there is a need to develop a novel communication protocol that can be used to ensure a fault-tolerant distributed (replicated) two-tier state machine.