Distributed systems running in error-prone and adversarial environments have to rely on trusted components. In today's Internet these are typically directory and authorization services, such as the domain name system (DNS), Kerberos, certification authorities, or secure directories. Building such centralized trusted services has turned out to be a valuable design principle for computer security because the trust in them can be leveraged to many, diverse applications that all benefit from centralized management. Often, a trusted service is implemented as the only task of an isolated and physically protected machine.
Unfortunately, centralization introduces a single point of failure. Even worse, it is increasingly difficult to protect any single system against the sort of attacks proliferating on the Internet today. One established way for enhancing the fault tolerance of centralized components is to distribute them among a set of servers and to use replication algorithms for masking faulty servers or devices. Thus, no single server has to be trusted completely and the overall system derives its integrity from a majority of correct servers.
The use of cryptographic methods for maintaining consistent state in a distributed system has a long history and originates with the work of M. Pease, R. Shostak, and L. Lamport, in “Reaching agreement in the presence of faults,” Journal of the ACM, vol. 27, pp. 228-234, April 1980.
The work of M. K. Reiter and K. P. Birman, “How to securely replicate services,” ACM Transactions on Programming Languages and Systems, vol. 16, pp. 986-1009, May 1994 introduces secure state machine replication in a Byzantine environment and a broadcast protocol based on threshold cryptography that maintains causality among the requests.
Since no robust threshold-cryptographic schemes and secure atomic broadcast protocols were not known at that time, no fully robust systems for an asynchronous environment with malicious faults could be designed.
Subsequent work by Reiter in “Distributing trust with the Rampart toolkit,” Communications of the ACM, vol. 39, pp. 71-74, April 1996 assumes a model which implements atomic broadcast on top of a group membership protocol that dynamically removes apparently faulty servers from the set.
M. Castro and B. Liskov in “Practical Byzantine fault tolerance,” in Proc. Third Symp. Operating Systems Design and Implementation, 1999 present a practical algorithm for distributed service replication that is fast if no failures occur. It requires no explicit time-out values, but assumes that message transmission delays do not grow faster than some predetermined function for an indefinite duration. Since this protocol is deterministic, it can be blocked by a Byzantine adversary (i.e., violating liveness). In contrast, an approach based on a probabilistic agreement protocol satisfying both conditions would be a better approach.
The Total family of algorithms for total ordering by L. E. Moser and P. M. Melliar-Smith, “Byzantine-resistant total ordering algorithms,” Information and Computation, vol. 150, pp. 75-111, 1999 implements atomic broadcast in a Byzantine environment, but only assuming a benign network scheduler with some specific probabilistic fairness guarantees. Although this may be realistic in highly connected environments with separate physical connections between all machines, it seems not appropriate for arbitrary Internet settings.
K. P. Kihlstrom, L. E. Moser, and P. M. Melliar-Smith, “The SecureRing protocols for securing group communication,” in Proc. 31st Hawaii International Conference on System Sciences, pp. 317-326, IEEE, January 1998 and the work of A. Doudou, B. Garbinato, and R. Guerraoui, “Abstractions for devising Byzantine-resilient state machine replication,” in Proc. 19th Symposium on Reliable Distributed Systems (SRDS 2000), pp. 144-152, 2000 are two examples of atomic broadcast protocols that rely on failure detectors in the Byzantine model. They encapsulate all time-dependent aspects and obvious misbehavior of a party in the abstract notion of a failure detector and permit clean, deterministic protocols. But failure detectors are not well understood in Byzantine environments.
U.S. Pat. No. 4,644,542 describes a method for reliably broadcasting information in a point-to-point network of processors in the presence of component faults provided that the network remains connected using only an exchange of messages. The method possesses the properties that every message broadcast by a fault-free processor is accepted exactly once by all fault-free processors within a bounded time, that every message broadcast is either accepted by all fault-free processors or none of them, and that all messages accepted by fault-free processors are accepted in the same order by all those processors. The method is based on a diffusion technique for broadcasting information and on special message validity tests for tolerating any number of component failures up to network partitioning or successful forgery.
U.S. Pat. No. 5,598,529 discloses a computer system resilient to a wide class of failures within a synchronized network. It includes a consensus protocol, a broadcast protocol and a fault tolerant computer system created by using the two protocols together in combination. The protocols are subject to certain validity conditions. The system in the state of consensus is guaranteed to have all non-faulty processors in agreement as to what action the system should take. The system and protocols can tolerate up to t processor failures out of 3t+1 or more processors, but requires as well as the before mentioned method timing guarantees and is therefore not suitable for asynchronous networks.
Fault-tolerant systems use computer programs called protocols to ensure that the systems will operate properly even if there are individual processor failures.
A fault-tolerant consensus protocol enables each processor or party to propose an action (via a signal) that is required to be coordinated with all other processors in the system. A fault-tolerant consensus protocol has as its purpose the reaching of a “consensus” on a common action (e.g., turning a switch off or on) to be taken by all non-faulty processors and ultimately the system. Consensus protocols are necessary because processors may send signals to only a single other processor at a time and a processor failure can cause two processors to disagree on the signal sent by a third failed processor. In spite of these difficulties, a fault-tolerant consensus protocol ensures that all non-faulty processors agree on a common action.
To reach consensus, consensus protocols first enable each processor or participating network device to propose an action (via a signal) that is later to be coordinated by all the processors or participating network devices in the system. The system then goes through the steps of the consensus protocol. After completing the consensus protocol steps, the common action of the consensus is determined.
A safe architecture for distributing trusted services among a set of servers is desired that guarantees availability and integrity of the services despite some servers being under control of an attacker or failing in arbitrary malicious ways. The architecture should be characterized by a static set of servers and completely asynchronous point-to-point communication. Trusted applications can only be achieved by an efficient and provably secure agreement and broadcast protocol.