As personal computing devices become more powerful, containing increased storage space and processing capabilities, the average user consumes an increasingly smaller percentage of those resources in performing everyday tasks. Thus, many of today's personal computing devices are often not used to their full potential because their computing abilities greatly exceed the demands most users place upon them. An increasingly popular method of deriving use and value from the unused resources of powerful modern personal computing devices is a distributed computing system, in which the computing devices act in coordination with one another to perform tasks and maintain data.
A distributed computing system can utilize a number of interconnected computing devices to achieve the performance and storage capabilities of a larger, more-expensive computing device. Thus, while each personal computing device may only have a few gigabytes of usable storage space, a distributed computing system comprising a number of such devices, can aggregate the available storage space on each individual device and present to a user a terabyte or more of useable storage space. Similarly, a distributed computing system can present to a user a large amount of useable processing power by dividing the user's tasks into smaller segments and transmitting the segments to the individual devices for processing in parallel.
To effectively derive value from the unused capabilities of modern personal computing devices, a distributed computing system should not interfere with the individual use of each personal computing device. By allowing individual users to retain control of the devices, however, the reliability of each device is greatly decreased. To compensate for the increased risk that the individual computing device may become disconnected from the network, turned off, suffer a system malfunction, or otherwise become unusable to the distributing computing system, redundancy can be used to allow the distributed computing system to remain operational. Thus, the information stored on any one personal computing device can be redundantly stored on at least one additional similar personal computing device, allowing the information to remain accessible, even if one of the personal computing devices fails.
Alternatively, a distributed computing system can practice complete redundancy, in which every device within the system performs identical tasks and stores identical information. Such a system can allow users to continue to perform useful operations even if all but one of the devices should fail. Alternatively, such a system can be used to allow multiple copies of the same information to be distributed throughout a geographic region. For example, a multi-national corporation can establish a world-wide distributed computing system. Such a corporation might use a number of high performance server computing devices, rather than less powerful personal computing devices, because each individual computing device would be required to service many users within that geographic region. The individual high performance devices can each perform identical tasks and store identical data, allowing users who merely seek to access the data to obtain such access from a high performance device located in a convenient location for that user.
However, distributed computing systems can be difficult to maintain due to the complexity of properly synchronizing the individual devices that comprise the system. Because time-keeping across individual processes can be difficult at best, a state machine approach is often used to coordinate activity among the individual devices. A state machine can be described by a set of states, a set of commands, a set of responses, and functions that link each response/state pair to each command/state pair. A state machine can execute a command by changing its state and producing a response. Thus, a state machine can be completely described by its current state and the action it is about to perform, removing the need to use precise time-keeping.
The current state of a state machine is, therefore, dependent upon its previous state, the commands performed since then, and the order in which those commands were performed. To maintain synchronization between two or more state machines, a common initial state can be established, and each state machine can, beginning with the initial state, execute the identical commands in the identical order. Therefore, to synchronize one state machine to another, a determination of the commands performed by the other state machine needs to be made. The problem of synchronization, therefore, becomes a problem of determining the order of the commands performed, or, more specifically, determining the particular command performed for a given step.
One mechanism for determining which command is to be performed for a given step is known as the Paxos algorithm. In the Paxos algorithm, any of the individual devices can act as a leader and seek to propose that a given function be executed by every device in the system as the command to be performed for a given step. Every such proposal can be sent with a proposal number to more easily track the proposals. Such proposal numbers need not bear any relation to the particular step for which the devices are attempting to agree upon a command to perform. Initially, the leader can suggest a proposal number for a proposal the leader intends to submit. Each of the remaining devices can then respond to the leader's suggestion of a proposal number with an indication of the last proposal they voted for, or an indication that they have not voted for any proposals. If, through the various responses, the leader does not learn of any other proposals that were voted for by the devices, the leader can propose that a given function be executed by the devices, using the proposal number suggested in the earlier message. Each device can, at that stage, determine whether to vote for the action or reject it. A device should only reject an action if it has responded to another leader's suggestion of a different proposal number. If a sufficient number of devices, known as a quorum, vote for the proposal, the proposed action is said to have been agreed upon, and each device performs the action and transmits the results. In such a manner, an agreed upon command can be determined to be performed for a given step, maintaining the same state among all of the devices.
Generally, the Paxos algorithm can be though of in two phases, with an initial phase that allows a leader to learn of prior proposals that were voted on by the devices, as described above, and a second phase in which the leader can propose functions for execution. Once the leader has learned of prior proposals, it need not continually repeat the first phase. Instead, the leader can continually repeat the second phase, proposing a series of functions, that can be executed by the distributed computing system in multiple steps. In such a manner, while each function performed by the distributed computing system for each step can be thought of as one instance of the Paxos algorithm, the leader need not wait for the devices to vote on a proposed function for a given step before proposing another function for the next step.
The Paxos algorithm, described above, assumes that a faulty device will simply cease communication and will not act upon any data. However, a device experiencing a “Byzantine” fault exhibits malicious behavior that is unpredictable and may appear to be functioning properly. The Paxos algorithm can be changed to operate properly even in the face of such malicious devices. Each message sent by a device can contain a proof of the message's authenticity, such as through the use of message authenticators, and can contain a proof that the information contained in the message is proper in light of the requirements of the Paxos algorithm. The requisite proof of propriety can be provided by adding two additional steps to the algorithm described generally above.
Byzantine faults can occur in two general varieties. Either a malicious device can spoof a message, such as by intercepting and changing a message between two properly functioning devices, or the malicious device can transmit false messages. Thus, to avoid messages from malicious devices, a properly functioning device receiving a message can seek to verify both that the message is unchanged and that the message is proper. Tampering or editing a message in transit can be detected through the use of message authenticators. Because messages between two devices may need to be forwarded onto other devices, the sending device can include authenticators of the message directed to both the initial destination device and the forwarded destination device. The authenticator of the message that is directed to the initial destination device can authenticate both the message itself and the authenticator of the message that is directed to the forwarded destination device.
The propriety of a message can be proven by illustrating that a sufficient number of devices within the system have agreed to the message. If a number of devices within a distributed computing system are malicious, those devices can work together and agree upon the transmission of false messages in an effort to deceive properly functioning devices. However, if a device receives the same information from more devices than there are malicious devices, then the information must be true because, even if all of the malicious devices participated, at least one of the messages must have come from a properly functioning device, and can therefore be trusted. More broadly, defining the variable “M” to represent the number of malicious devices within a distributed computing system, any device can trust information which it has received from at least M+1 different devices. A transmitting device can prove the propriety of a message by sending, with the message, a sufficiently large collection of messages originally sent to that transmitting device that indicate the information contained in the message is true. However, a message sent by a malicious device could be properly authenticated for the transmitting device, yet may not be properly authenticated for the receiving device. Thus, the transmitting device, upon receiving M+1 messages containing the same information, may properly believe that the information is true, but if it seeks to forward those messages onto a receiving device, it is possible that only one of them will be properly authenticated for the receiving device. However, the receiving device, like any other device, requires that M+1 equivalent properly authenticated messages assert the information before it can believe that the information contained in the messages is true. Therefore, to ensure that the receiving device receives at least those M+1 messages, the transmitting device can forward a collection of messages having at least M+1+M or 2M+1 messages. Such a collection is sufficiently large that, even if a message from every malicious device was included, M+1 equivalent properly authenticated messages would still be received by the receiving device. Therefore, the receipt of M+1 equivalent properly authenticated messages by any device enables that device to trust the information contained in the message. Furthermore, if 2M+1 equivalent properly authenticated messages are received, the receiving device can forward those messages along to prove to the device receiving the forwarded messages that the information contained in the messages is true.
A transmitting device can, therefore, prove the propriety of a message by sending, with the message, a sufficiently large collection of messages originally sent to that transmitting device that indicate the information contained in the message is true. Like the Paxos algorithm, the modified Paxos algorithm which operates properly with malicious devices can be conceptually divided into a first phase in which the leader learns of prior, “safe” proposals, and a second phase in which the leader proposes functions for execution by the distributed computing system. An additional step can be added to the first phase of the Paxos algorithm that allows each of the recipient devices to transmit, to the other devices in response to a message suggesting a proposal number from the leader, the most recent proposal for which that recipient device has voted together with a proof that the device was allowed to vote for that proposal, as will be described below. Once each device receives the messages from the other recipient devices, each can independently determine safe proposals, or proposals not submitted by malicious devices which other, also non-malicious, devices had already voted for. Such safe proposals can be determined by finding proposals for which messages were received from a sufficient number of devices indicating that those devices had voted for that proposal. The determined safe proposals can then be transmitted to the leader, together with the messages from the other devices as proof that the determined safe proposals are, in fact, safe.
Proposals can be submitted by the leader for voting, using the messages transmitting the determined safe proposals as a proof of the safety of the proposal. An additional step can then be added to the second phase of the Paxos algorithm that allows each of the devices to send a message to each other indicating that the current proposal is the only proposal with that proposal number for which the device will vote. A device will accept a proposal submitted for voting, so long as that device received such messages from a quorum of devices and so long as that device has not responded to another message, such as from another leader, suggesting a different proposal number. If a device accepts the proposal, it can send a message to the leader, as before, signaling its acceptance. Additionally, the devices can save the messages from the quorum of devices indicating that the proposal is the only proposal with that proposal number for which those devices will vote in order to provide proof of the appropriateness of casting the vote when the device sends an indication of the last proposal it voted for, as stated above.
Upon receipt of messages from a quorum of devices accepting the proposal, the leader can transmit a message to all of the devices requesting that they execute the function contained in the proposal, together with proof that the leader is performing properly in making such a request, which comprises the quorum of messages received from the devices. The leader can also attach to the success message another proposal for which voting is solicited, increasing the efficiency of the algorithm. Additionally, as described above, once the leader has learned of all of the safe proposals for current and future steps of the system, it can continue to propose functions for future steps prior to receiving a vote from the devices on the proposed function for the current step.
However, as can be seen, the modified Paxos algorithm that can accommodate Byzantine failures can add message delays such that at least three message delays exist between the transmission of a request by a client of the distributed computing system and the receipt, by the client, of a response to the client's request. For example, once a client's request is received by a leader device, one message delay can be required to transmit the request, as a proposal, to the devices for a vote. A second message delay can be introduced when each of the devices send, to one another, a message indicating that the received proposal is the only proposal with that proposal number for which those devices will vote. Finally, a third message delay can be required to transmit the devices' votes to the leader device. After receiving the votes and determining the result, the leader can inform the client of the result. Depending on the type of network used, and the proximity of the devices, such message delays can cause a noticeable slowness in the overall system. As a result, it is desirable to reduce the number of message delays required between the receipt of a client's request, and the response to the client.