As personal computing devices become more powerful, containing increased storage space and processing capabilities, the average user consumes an increasingly smaller percentage of those resources in performing everyday tasks. Thus, many of today's personal computing devices are often not used to their full potential because their computing abilities greatly exceed the demands most users place upon them. An increasingly popular method of deriving use and value from the unused resources of powerful modern personal computing devices is a distributed computing system, in which the computing devices act in coordination with one another to provide more reliable access to data and computational resources.
In addition to providing a useful mechanism for using excess computing capacity, distributed systems can also be composed of dedicated inexpensive computing devices in order to achieve the performance and storage capabilities of a larger, more-expensive computing device. A further advantage of distributed systems is the ability to continue to operate in the face of physical difficulties that would cripple a single, larger computing device. Such difficulties could include: sustained power outages, inclement weather, flooding, terrorist activity, and the like.
To compensate for the increased risk that individual member computing devices may become disconnected from the network, turned off, suffer a system malfunction, or otherwise become unusable, redundancy can be used to allow the distributed computing system to remain operational. Thus, the information stored on any one personal computing device can be redundantly stored on at least one additional personal computing device, allowing the information to remain accessible, even if one of the personal computing devices fails.
A distributed computing system can practice complete redundancy, in which every device within the system performs identical tasks and stores identical information. Such a system can allow users to continue to perform useful operations even if all but one of the devices should fail. Alternatively, such a system can be used to allow multiple copies of the same information to be distributed throughout a geographic region. For example, a multi-national corporation can establish a world-wide distributed computing system.
However, distributed computing systems can be difficult to maintain due to the complexity of properly synchronizing the individual devices that comprise the system. Because time-keeping across individual processes can be difficult at best, a state machine approach is often used to coordinate activity among the individual devices. A state machine can be described by a set of states, a set of commands, a set of responses, and client commands that link each response/state pair to each command/state pair. A state machine can execute a command by changing its state and producing a response. Thus, a state machine can be completely described by its current state and the action it is about to perform, removing the need to use precise time-keeping.
The current state of a state machine is, therefore, dependent upon its previous state, the commands performed since then, and the order in which those commands were performed. To maintain synchronization between two or more state machines, a common initial state can be established, and each state machine can, beginning with the initial state, execute the identical commands in the identical order. Therefore, to synchronize one state machine to another, a determination of the commands performed by the other state machine needs to be made. The problem of synchronization, therefore, becomes a problem of determining the order of the commands performed, or, more specifically, determining the particular command performed for a given step.
One mechanism for determining which command is to be performed for a given step is known as the Paxos algorithm. In the Paxos algorithm, any of the individual devices can act as a leader and seek to propose a given client command for execution by every device in the system. Every such proposal can be sent with a proposal number to more easily track the proposals. Such proposal numbers need not bear any relation to the particular step for which the devices are attempting to agree upon a command to perform. Initially, the leader can suggest a proposal number for a proposal the leader intends to submit. Each of the remaining devices can then respond to the leader's suggestion of a proposal number with an indication of the last proposal they voted for, or an indication that they have not voted for any proposals. If, through the various responses, the leader does not learn of any other proposals that were voted for by the devices, the leader can propose that a given client command be executed by the devices, using the proposal number suggested in the earlier message. Each device can, at that stage, determine whether to vote for the action or reject it. A device should only reject an action if it has responded to another leader's suggestion of a different proposal number. If a sufficient number of devices, known as a quorum, vote for the proposal, the proposed action is said to have been agreed upon, and each device performs the action and can transmit the results. In such a manner, each of the devices can perform actions in the same order, maintaining the same state among all of the devices.
Generally, the Paxos algorithm can be thought of in two phases, with an initial phase that allows a leader to learn of prior proposals that were voted on by the devices, as described above, and a second phase in which the leader can propose client commands for execution. Once the leader has learned of prior proposals, it need not continually repeat the first phase. Instead, the leader can continually repeat the second phase, proposing a series of client commands that can be executed by the distributed computing system in multiple steps. In such a manner, while each client command performed by the distributed computing system for each step can be thought of as one instance of the Paxos algorithm, the leader need not wait for the devices to vote on a proposed client command for a given step before proposing another client command for the next step.
The distributed computing system, as a whole, can be modeled as a state machine. Thus, a distributed computing system implementing complete redundancy can have each of the devices replicate the state of the overall system. Such a system requires that each device maintain the same state. If some devices believe that one client command was executed, while a second group of devices believes that a different client command was executed, the overall system no longer operates as a single state machine. To avoid such a situation, a majority of the devices can be generally required to select a proposed client command for execution by the system. Because any two groups of devices, each having a majority, must share at least one device, mechanisms, such as the Paxos algorithm, can be implemented that rely on the at least one common device to prevent two groups, each containing a majority of devices, from selecting different proposed client commands.
However, the Paxos algorithm adds message delays between when a client sends a request for the distributed system to execute a command, and when the client receives the results from the execution that command. Specifically, even if the client transmits a request to a leader, and even if the leader has already learned of previously voted on proposals, and thus has completed the first phase of the Paxos algorithm, there can still be two or more message delays between the transmission of the request from the client, and the transmission of the results to the client. Furthermore, the Paxos algorithm can require the presence of a leader device that receives client requests and determines the appropriate functions to submit for a vote to the devices of the distributed computing system. Should such a leader device fail, a new leader may not take its place immediately, leaving the distributed computing system idle and the client waiting for a response to its requests.
One mechanism for implementing a distributed fault tolerant algorithm having fewer message delays is a Fast Paxos algorithm in which the first phase of the standard Paxos algorithm is performed by a leader and the second phase is performed directly by clients of the distributed system. Thus, a leader device can learn of previously voted on proposals, and can ensure that devices in the distributed computing system have agreed on a common state. Once the leader learns of no further pending proposals, it can signal to the other devices that they treat messages received directly from the clients of the system as proposals using the proposal number the leader learned of while performing the first phase. A client can then send proposals directly to the devices which, unless they have previously voted for a proposal, can vote for the client's proposal. Because there is no leader device to collect votes, the devices can execute the proposed function instead of voting for it. Once the client receives responses from a sufficient number of devices, it can determine that the system has executed the function it proposed. In such a manner the client can receive a response without any message delays between the transmission of the client's proposal, and the devices' responses.
However, the Fast Paxos algorithm cannot tolerate a conflict among two or more clients. Specifically, if two or more clients propose different functions at approximately the same time, the devices may be unable to choose between the different functions. In such a case, the system must stop using the Fast Paxos algorithm and return to the regular Paxos algorithm, with the leader beginning with the first phase, in an effort to resolve the discrepancy among the devices in the system. In such a case, the two or more clients that submitted the conflicting proposals may experience an even greater delay in receiving their responses than if the system had never attempted to operate using the Fast Paxos algorithm.