Synchronization and coordination algorithms are part of distributed computer systems. Clock synchronization algorithms are essential for managing the use of resources and controlling communication in a distributed system. Also, a fundamental criterion in the design of a robust distributed system is to provide the capability of tolerating and potentially recovering from failures that are not predictable in advance. Overcoming such failures is most suitably addressed by tolerating Byzantine faults. A Byzantine fault is an arbitrary fault that occurs during the execution of an algorithm by a distributed system. It encompasses those faults that are commonly referred to as “crash failures” and “send and omission failures.” When a Byzantine failure has occurred, the system may respond in any unpredictable way, unless it is designed to have Byzantine fault tolerance. The object of Byzantine fault tolerance is to be able to defend against a Byzantine failure, in which a component of some system not only behaves erroneously, but also fails to behave consistently when interacting with multiple other components. Correctly functioning components of a Byzantine fault tolerant system will be able to reach the same group decisions regardless of Byzantine faulty components.
There are upper bounds on the percentage of traitorous or unreliable components, however. A Byzantine-fault model encompasses all unexpected failures, including transient ones, within the limitations of the maximum number of faults at a given time. A distributed system tolerating as many as ‘F’ Byzantine faults requires a network size of more than 3F nodes. Byzantine agreement cannot be achieved for fewer than 3F+1 nodes, as at least 3F+1 nodes are necessary for clock synchronization in the presence of F Byzantine faults.
A distributed system is defined to be self-stabilizing if, from an arbitrary state and in the presence of a bounded number of Byzantine faults, it is guaranteed to reach a legitimate state in a finite amount of time and remain in a legitimate state as long as the number of Byzantine faults is within a specific bound. A legitimate state is a state in which all good clocks in the system are synchronized within a given precision bound. Therefore, a self-stabilizing system is able to start in a random state and recover from transient failures after the faults dissipate.
There are known algorithms that address permanent faults, where the issue of transient failures is either ignored or inadequately addressed. There are known efficient Byzantine clock synchronization algorithms that are based on assumptions on initial synchrony of the nodes or existence of a common pulse at the nodes. There are known clock synchronization algorithms that are based on randomization and, therefore, are non-deterministic. Some known clock synchronization algorithms have provisions for initialization and/or reintegration. However, solving these special cases is insufficient to make the algorithm self-stabilizing. A self-stabilizing algorithm encompasses these special scenarios without having to address them separately. The main challenges associated with self-stabilization are the complexity of the design and the proof of correctness of the protocol. Another difficulty is achieving efficient convergence time for the proposed self-stabilizing protocol.