For the sake of safety and reliability, critical computing tasks often times rely on redundant components or systems based on highly accurate, fault-tolerant timing schemes so that variations in performance of system components can be compared and, if warranted, removed from service. Such timing schemes are very important for processors that must be synchronized for voting or that integrate or differentiate sampled inputs with respect to time. Such systems rely on an accurate clock to keep the output accurate. Inertial navigation systems are a typical example of a system that must integrate and differentiate input signals accurately so that position may be properly maintained.
An exercise in logic analysis known as the Byzantine Generals' Problem establishes the concept of a Byzantine fault, the occurrence of which many different prior fault tolerant schemes have attempted to mitigate. A Byzantine fault is any fault which presents different symptoms to different observers. A Byzantine Generals' Problem is simply any Byzantine fault that can lead to a system failure. The classic exercise shows that if there are N generals operating to defeat any enemy, more than two thirds of N generals must be loyal to guarantee that the loyal generals can properly reach agreement on a plan of battle. By analogy, a single clock channel failure can prevent two other clock channels from being correctly synchronized. Thus, at least four clock fault containment regions are required to tolerate a single Byzantine fault. For F Byzantine faults to be tolerated, the system needs at least 3F+1 fault containment regions and F rounds of communication.
The previous art in digital clock fault tolerance is diverse and tries to solve many problems. Some of the teachings do not understand all of the problems or understand them incompletely. None of the previous art solves all of the problems to the degree the current invention does. The previous art has one or more of the following main problems:
(a) Insufficient Accuracy--It has been recognized in much of the prior art that delays in the voting and exchange circuitry of a fault-tolerant clock adds to inaccuracy. What has been under-recognized is the fact the integration used in most phase locked loops (PLLs) for various fault-tolerant clock designs can be an even larger impact on inaccuracy. For example, in the often cited paper "Fault-tolerant Clocking System" published in the digest of papers for the 1991 International Symposium on Fault-Tolerant Computing, the author presents an analysis of the phase locked loops as used in the related U.S. Pat. No. 4,239,982 and also comments on the adverse effects of delay on accuracy; but fails to recognize the adverse effects of the integrators (low pass filters) in the PLLs on accuracy. The paper "Achievable Performance of Fault-tolerant Avionics Clocks" by Krause, Englehart, and Shaner for the 8th Computing in Aerospace Conference, 1991 details the adverse effects of integrators in cross-strapped PLLs as commonly used in fault-tolerant clocks. This finding is counter-intuitive to most practitioners of the art. Another counter-intuitive finding is that accuracy is maximized when the voting rate is minimized. These concepts are addressed by this invention.
(b) Not Digital--Most fault-tolerant clock designs of the previous art have some analog components which directly effect the accuracy and fault tolerance of the clocking mechanism. Analog components have tolerances, changes with age, failure modes, and fault propagation modalities that make them hard to reason about in a formal way such that mathematical proofs of correctness can be applied. In the current digital technology development, digital components are becoming very dense, cheap, and have standardized physical dimensions. Using analog components in an otherwise digital systems adds a disproportionate cost to the system.
(c) Use Naive Fault Tolerance Assumptions--Most fault-tolerant clock designs assume failures are only the simple "stuck-at" or too-fast/too-slow types. Some have begun recognizing, but not fully understanding, Byzantine faults. Only a few include provisions for over-voltage and similar faults. Most ignore the problems of metastability induced errors and all possible start-up scenarios. Metastability and start-up pathologies can cause many so-called fault-tolerant clocks to fail, even with no component failures. Some start-up pathologies are unavoidable in totality, but can be minimized if understood. Another example from U.S. Pat. No. 4,239,982, is that it requires an even number of clock sources. This can be seen as diametrically opposed to best practice when one looks at the possibility of one-half of the clock sources starting exactly in-phase with each other and exactly 180 degrees out-of-phase with the other one-half of the clock sources. There is no way of guaranteeing that this condition will not persist indefinitely. On the other hand, an odd number of clocks cannot get into this one-half versus one-half situation.
The design of fault-tolerant clocks can be divided into the following groups based on the means they use to cross-couple their redundancies:
(a) Coupling of pulses directly into the feedback path of the oscillators--This can be done with simple analog components, but this has the problems of analog components stated above; particularly, it is extremely difficult to guarantee, to a high confidence, that all fault effects are contained within their respective fault containment regions. This coupling can also be done digitally with voting. In this case, inaccuracy is maximized because delay errors are injected at the highest possible rate.
(b) Primary with backup(s)--These schemes can be highly accurate because feedback errors do not contribute to inaccuracy. However, they are not very fault tolerant. It is difficult to create a fault detection mechanism which has guaranteed high coverage and can cause a switch between a faulty primary to a good backup before the faulty primary signal causes harm to the rest of the system. Most of these schemes do not address system failures caused by Byzantine faults.
(c) Peer, cross-coupled digital signals--This group has the best possibility of meeting all the requirements of an accurate, truly fault-tolerant clock.
The present invention belongs to the latter group.