This invention relates to fault-tolerant computing using replicated services.
The growing reliance of industry and government on online information services makes the consequence of failures of these services more serious. Furthermore, malicious attacks on these services have become increasingly attractive to some. One approach to design of fault-tolerant systems that are resistant to faults and malicious attacks is called xe2x80x9creplication.xe2x80x9d In replication, services are redundantly implemented, or replicated, at a number of nodes, such as on different computers on a computer network. The replicated system as a whole continues to respond correctly to client requests even when some of the replicas are faulty or have been compromised by an attack. In some approaches to replication, the replicated nodes, or xe2x80x9creplicas,xe2x80x9d operate asynchronously, while in others, the replicas operate in lock-step. Byzantine-fault-tolerant replication addresses not only faults at replicated nodes which result in the nodes not responding to requests (xe2x80x9cfail-stopxe2x80x9d errors), but also addresses the situation in which a node appears to be operating correctly but in fact is not providing correct responses. A node may be providing incorrect responses due to errors in implementation of the node (i.e., xe2x80x9cbugsxe2x80x9d) or may be operating incorrectly as a result of an attack by a malicious outside party. Attackers may compromise the correct operation of a node, and may also disrupt communication between nodes, overload nodes in xe2x80x9cdenial of servicexe2x80x9d attacks, or send messages to nodes attempting to impersonate other correctly operating nodes.
Prior asynchronous replication-based algorithms have been proposed which guarantee integrity for the service provided that greater than ⅔ of the replicas remain fault-free during the lifetime of the service.
Some prior systems actively attempt to identify which nodes are faulty and remove them from service. With fewer nodes remaining, the system may be less tolerant of further faults. One mode of attack on such a system is to attempt to have the system remove nodes that are not in fact faulty from service, thereby making it easier to compromise the remaining nodes.
A number of prior systems have been tailored to services that essentially provide xe2x80x9cwrite,xe2x80x9d xe2x80x9cread,xe2x80x9d and xe2x80x9clockxe2x80x9d services for a data store. A client uses these primitives to implement more complex operations on the data store.
In order to ensure authenticity of messages passed between replicated nodes, some replicated systems use public key cryptography to sign messages so that any recipient that has a trusted copy of the public key for a sender can authenticate a message that was received from the sender, possibly via another node. Signing messages using public key techniques can be computationally expensive.
In a general aspect, the invention provides a new approach for asynchronous state-machine replication in a fault-tolerant system. The approach offers both integrity and high availability in the presence of Byzantine faults. The approach also improves the security of previous systems by recovering replicas proactively without necessarily identifying that they have failed or been attacked. This proactive recovery limits the time extent of a particular fault by regularly recovering replicas. In this way, the system works correctly even when all the replicas fail multiple times over the lifetime of the system, provided that less than ⅓ of the replicas are all faulty within a window of vulnerability. The approach also features an efficient implementation of message authentication that prevents an attacker from impersonating a replicated node that was faulty after that node recovers.
In one aspect, in general, the invention is a method for fault tolerant operation of a distributed server system that includes N asynchronous servers that may experience faults. The method includes receiving a series of requests from a client over a time interval associated with the requests. At each of the N servers, some or all of the client requests are processed. For each of the requests processed at a server, a state of a state machine at that server is updated according to the request and a response is transmitted to the client. The method also includes resetting each of the N servers repeatedly during the time interval. Resetting a server includes establishing the state of the state machine at that server using data stored at other of the servers so that the state at that server corresponds to a common state of the server system. When (a) for a predetermined duration time window, fewer than N/3 of the server systems experience faults in any time window of the time interval of the requests of that predetermined duration, and (b) N/3 or more of the N servers experience faults at some time during the time interval of the requests, the N servers provide responses to the client that are sufficient for the client to determine correct responses to each of the series of requests.
The invention can include one or more of the following features.
The faults experienced by the N servers include Byzantine faults.
The faults experienced by the N servers include faults resulting from denial-of-service attacks in which communication between the servers is interrupted.
The method further includes, during the time interval of the requests, identifying a series of master servers from the N servers such that different servers are identified as master servers at different times. For each of the requests from the client, the method includes (a) receiving the request at a master server, (b) establishing a common sequence number for the request among greater than ⅔ of the N servers, and (c) processing the request at servers at which the common sequence number has been established. When ⅓ or fewer of the N servers are faulty, this results in greater than ⅓ of the N servers being both not faulty and transmitting a response to the client.
Establishing the state of the state machine at a server that has been reset using data stored at other of the servers includes partitioning the state into separate parts. The values of the state for the separate parts are retained from prior to resetting the server. For each separate part at that the server, a digest characterizing the retained value of the state in that part is computed. A sufficient number of digests of that part of the state at other of the N servers are received from those other servers to determine whether the digest matches the common value of that part of the state. If for any part of the state the digest computed at the server does not match the digest of the common value of that part of the state, the values of at least some of that part of the state are transferred from another of the N servers.
Establishing the state of the state machine at a server that has been reset using data stored at other of the servers further includes partitioning the state into a hierarchy of parts, such that parts of the state are partitioned into sub-parts. If the digest for any part of the state that is computed at the server does not match the digest of the common value of that part of the state, a digest characterizing each of the sub-parts of that part is computed. A sufficient number of digests of those sub-parts of the state at other of the N servers are received from those other servers to determine whether the digests match the common values of those sub-parts of the state.
Processing at least some of the requests include processing a complex operation involving multiple updates to the state machine according to each of those requests.
The method further includes, at each of the N servers, computing symmetric keys for communicating with each of the other of the N servers, and distributing the symmetric keys to the other servers. The steps of computing and distributing the keys are repeated during the time interval.
Distributing the symmetric keys to the other servers includes encrypting the keys in a message using public key cryptography.
In another aspect, in general, the invention is a method for fault tolerant operation of a distributed computer system. The distributed computer system includes one or more clients and 3F+1 server nodes, which in normal operation operate asynchronously and implement a common state machine and during faulty operation F or fewer of the server nodes are concurrently faulty. The method includes receiving a request from a client at a designated master node of the 3F+1 server nodes. A common sequence number for the request is established among at least 2F+1 of the 3F+1 server nodes using a three-phase message exchange. During the first phase a first message is sent from the designated master node to other of the server nodes identifying the received request. During a second phase a second message is sent from each non-faulty server node that received the first message to all other of the server nodes. During a third phase, a third message is sent from each of the non-faulty server nodes that received the second message to all other of the server nodes. At each of F+1 or more of the 3F+1 server nodes that are not faulty and that have received the third message, the request is processed and a result is transmitted to the client.
The invention can include one or more of the following features.
The steps of receiving a request from a client, establishing a common sequence number for the request, and processing and transmitting the result are repeated for multiple subsequent requests.
Establishing the common sequence number for a subsequent request occurs prior to completion of the processing of a request received prior to that request.
The invention includes one or more of the following advantages:
Proactive recover of replicated nodes has an advantage of preventing an attacker from slowly compromising the overall service by corrupting ⅓ or more of the replicated nodes without being detected. By regularly recovering each node, even if it does not appear to have been corrupted, an attacker must take over ⅓ of the nodes within a finite time window. The shorter the window, the more difficult it is for the attacker to succeed. This window of vulnerability can be made very small under normal load conditions (e.g., a few minutes).
The approach also provides a mechanism for tolerating faults caused by software bugs in the implementations of the replicas. This mechanism is further enhanced if multiple implementations are used for various of the replicas, in which case a particular software bug is less likely to be present in all the implementations.
By implementing services that implement complex operations, the system does not depend on correct operation of the clients. For example, if only primitive services were provided and the clients were relied upon to perform correct sequences of those primitive operations to effect complex operations, the system could be vulnerable to corruption if a client were faulty or subject to an attack. In a particular example, a file system service providing complex operations which guarantee that the meta-data (e.g., inodes) of the file system satisfy particular invariants may be less vulnerable to corruption by a client than would be a system in which the client is responsible for maintaining the consistency of the meta-data.
The approach also has the advantage of providing detection of denial-of-service attacks aimed at increasing the window of vulnerability. Replicas can time how long a recovery takes and alert their administrator if it exceeds some pre-established bound.
The approach also provides an efficient mechanism for detecting when the state at a replica has been corrupted, and for recovering the state from other of the replicas without necessarily requiring that the entire state be transferred to the corrupted replica.
Proactive recovery of replicas also prevents the service from being compromised by an attack that escapes detection unless the attack compromises more than ⅓ of the replicas within a small time window.
Using symmetric cryptography for messages between the replicas can be more efficient than using public key cryptography for all messages. The approach provides an efficient mechanism for authenticating a message that has passed through a possibly corrupted replica without requiring that the message be encrypted using the public key of the originator of the message.
Each replica can operate with a bounded storage and performs garbage collection to enforce the bound. If necessary, for example in the presence of a Byzantine fault, deleted or corrupted state can be recovered from other of the replicas.