A commercial aircraft or spacecraft typically includes many redundant systems to insure the safety of the passengers in the event that a critical component in one of the systems fails. For example, the avionics system on such a craft can include three or more redundant computers and supporting hardware, each computer being connected to process the same data from a common source or redundant data from different sources monitoring the same parameter, in order to produce an output that is consistent in the presence of faults. Should a fault cause an error in any of the data so that different results are produced by one of the redundant computers, the results of the other two computers that agree are used by the avionics system. The step of selecting data for use in a process from among the outputs of redundant sources is called "voting." Although noise or other transient occurrences can produce one-time disagreements between redundant data sources that do not have a long-term adverse impact on a system, continuing disagreements usually indicate a failed component or break in communications on one of the channels, e.g., due to an intermittent connector that causes one bit in the data transmitted to be different on one channel than it is in all other channels. Accordingly, fault tolerant systems can be configured to lock out one of the redundant channels if it continues to produce results that vary from the other channels.
A truly fault tolerant system employs redundancy at multiple levels. For example, in addition to including redundant computers for processing data, a fault tolerant system should have redundant sensors to monitor critical parameters. Redundant clocks are also often provided to insure that the computers operate on a consistent time base, and a fault tolerant method is then required for selecting the clock signal or time base used by the system. In commonly assigned U.S. Pat. No. 4,979,191, a fault tolerant clock system is disclosed that carries out this function in a novel manner. The fault tolerant clock system disclosed therein is "Byzantine resilient," because it uses a method that addresses the classical exercise in logic known as the Byzantine Generals' Problem.
In the Byzantine Generals' Problem, a city is surrounded by the Byzantine Army, separate divisions of which are each controlled by one of N different generals. Communication between the generals is limited to oral messages carried by runners. One or more of the N generals may be a traitor who will attempt to confuse the other generals by sending false messages. For the simple case in which there are only three generals, it has been shown that a single traitor can confuse two loyal generals, leading to the theorem that more than two thirds of N generals must be loyal to guarantee that the loyal generals can properly reach agreement on a plan of battle.
By analogy to this classic problem, a single clock channel in which a fault appears can prevent two other clock channels from being correctly synchronized if the fault causes a different time base signal to be conveyed to each of the properly operating clock channels during an attempted synchronization process. Based on this theorem, it would appear that at least four redundant clock channels are required in a clock system in order to tolerate a single fault. However, in U.S. Pat. No. 4,979,191, the fault tolerant clock system disclosed achieves Byzantine resilience by using only three clock channels, each having two fault containment regions. Thus, with only three channels, a single fault can be tolerated (a plurality of faults in a given fault containment region are considered equivalent to a single fault).
The Byzantine resilience requirement also extends to other aspects of a fault tolerant system, including the exchange of data over redundant channels. The data may comprise sensor input signals or the output signals produced by redundant computers in response to input signals that are nominally the same. A data exchange unit (DEU) should be incorporated into the system for each channel to insure that non-faulty, redundant computing channels receive identical input data and transmit identical output data. A voting function is required in order to accommodate faults in the data interchange, since the data being exchanged on the redundant channels can differ between channels.
Others have attempted to provide solutions to this problem. For example, U.S. Pat. No. 4,101,958 discloses apparatus and a method for transferring redundant control data in an aircraft digital flight control system. To control data exchange among redundant computation channels, the system includes a register into which a "tic" is entered when selected data are written into a main memory. The position of the tic in the register corresponds positionally to the address of the main memory at which the data are being written. The tic-containing register is searched, and when the tic is found, the data are retrieved from the main memory, multiplexed in sets with raw sensor data, and transmitted to the other redundant channels in serial format. However, this approach is not Byzantine resilient, because it can not be proven immune to any single point failure. Also, the voting is done by the central processing unit (CPU), using direct memory access (DMA) to move the data into the CPU, and interleaves sensor data with computed data. This approach increases the computational overhead on the CPU and the likelihood of communication errors.
Ideally, a data exchange unit (DEU) should support multiple levels of data management system redundancy, e.g., from one to four channels, without any hardware changes to the unit. The DEU should also provide a mechanism for inter-channel communications, to insure that the data management system can initialize in the presence of any single fault. None of the prior art data exchange methods or apparatus include these capabilities.