1. Field of the Invention
The invention relates generally to fault tolerant computer systems, and more particularly to processes for assuring that the computational elements in multi-computer fault-tolerant computer systems start with the same data base of data.
2. Description of the Prior Art
Computer systems for use in applications requiring extreme reliability can be developed through either of two basic approaches. One approach is to build the system fault-resistant; that is, such that each element of the system is unlikely to fail. The other approach is to build the system fault-tolerant. The latter approach comprises redundant components together with a method of selection as to which results are accepted from the redundant components so as to allow some components within the system to fail, and still have the system produce the proper result. A number of articles discussing various aspects of fault-tolerant computer systems appear in Proceedings of the IEEE, Volume 66, No. 10, October, 1978. Fault-tolerant computer systems which begin with a single data source and utilize multiple computational elements operate on the principle that each identical computational element, starting with the same data and implementing the same program, will produce the same result unless a fault is present in the system. The common approach to such a system utilizes circuitry on the output of the multiple computational elements which selects an output which is consistent with a majority of the computational elements as the output for the system.
In applications where the source of the computer systems' input data is also subject to faults, redundancy in that area can also be utilized. So long as these redundant input sources must produce identical data unless one is faulted, such a system will operate in the same manner as that described above. Some applications will result in systems where there can be slight variations in the input data between various input sources, without any of such data necessarily being "wrong." This is a common problem in applications where the computer system is utilized to control a process and the input data involves the measurement of a physical property of a continuous nature such as temperature or pressure. Frequently, an analog transducer is used in the sensing circuitry and its output converted to digital data, with slight variation resulting in the various transducer or converter outputs.
In a computer system employing multiple redundant computational elements, it is necessary that all of the computational elements utilize the same input data. Further, if the system is intended to tolerate one of the computational elements becoming faulty, then all the non-faulty computational elements should utilize the same input data, regardless of the behavior of the faulty one. It can be shown that for prior art systems, one computational element can become faulty in such a way that the non-faulty computational elements will not utilize the same input data and, therefore, will not necessarily produce the same results. The particular use of three redundant computational elements has been studied extensively in the literature under the general topics of "Achieving Interactive Consistency" and "The Byzantine General's Problem". The general problem discussed in the literature can be illustrated with a particular example. Assume three computational channels (CAA, CBB and CCC). Each read some physical property, such as flow or temperature, and they receive slightly differing data, such as 372, 374, 376 respectively, because of the inherent small differences between their analog converter devices. In order that the three computational channels all utilize the same input, each of them communicates their view of the data to each other. Each of them then uses the same method to select the value to be used, such as the average or the middle value, to be used in subsequent calculations. By this means, all computational channels will carry out the same calculation and arrive at the same results. Consider now that CAA becomes faulty and does not communicate the same data to CCB and CCC, specifically instead of its data (372 in this example), it communicates 374 to CCB and 378 to CCC. Each of CCB and CCC properly communicates their data (374 and 376, respectively) to the other two computational channels. The three computational channels now possess the following views of the data:
CCA has 372 as its own data and 374 and, 376 from the others PA1 CCB has 374 as its own data and 374 and 376 from the others PA1 CCC has 376 as its own data and 376 and 378 from the others
Each computational channel now applies a selection algorithm, but because of the disparity of their sets of input data, they will select differing values and computations that rely on that data can produce different results, thereby defeating the attempt of the system to be able to tolerate a fault without distorting the system. Other examples of the problem of achieving interactive consistency occur in modular redundancy schemes and have been discussed in the literature.