This invention was made with Government support under contract NAS1-18565 awarded by NASA. The Government has certain rights in this invention.
Byzantine resilient data processing systems have been described in the art, one such system being described for example, in U.S. Pat. No. 4,907,232, issued to R. Harper and J. Lala on Mar. 6, 1990. In such a system groups of redundant components are utilized in separate fault containment regions (FCRs), sometimes also referred to as fault containment channels or lanes, which regions are electronically and physically isolated from each other. Thus, the failure of a component within one FCR may cause the failure of other components in the same FCR but such failure cannot induce faults in the other FCRs. Moreover, erroneous behavior in one FCR cannot cause the aggregate of FCRs to exhibit erroneous behavior.
Error propagation can occur, however, in such a system when a faulty FCR transmits its faulty data to another FCR. If a functional FCR receiving such data does not react thereto as other functioning FCRs, such recipient FCR may appear faulty. Fault masking techniques, i.e., data voting techniques, are used to prevent faulty data from degrading the operation of one or more functioning FCRs, an FCR receiving redundant data from other FCRs and applying such fault masking techniques thereto to mask a given number of erroneous data items. Accordingly, such a system maintains its functionality even in the presence of Byzantine failures.
It has been found that such a Byzantine resilient system can tolerate the loss of F fault containment regions under the following conditions: (1) if (3F+1) FCRs are utilized, (2) if each FCR is connected to at least (2F+1) other FCRs by disjoint communication links or paths, (3) if (F+1) rounds of data exchange are used to distribute single-source data, and (4) if the operations of functioning FCRs of the system are time synchronized to within a known and specified time skew. Thus, for a single-fault tolerant system where (F=1), four FCRs are utilized, each FCR being connected to each of the other three FCRs and two rounds of data exchanges being used to distribute single-source data.
As the requirements of such systems are increased because of desired increases in the functionality and operating system requirements thereof, it has been found that the memory size requirements of such systems are also significantly increased. Increased memory size has a negative effect on both the reliability and the cost of the system, as well as reducing the ability of the system to reintegrate FCRs into system use after a transient fault occurs therein, the time required to reintegrate an FCR normally being dominated by memory realignment time.
It is desirable to provide a Byzantine resilient fault tolerant system in which memory size requirements can be reduced as compared to such requirements in currently proposed systems. It is especially important that such systems remain ulta-reliable with respect to single-source data, sometimes referred to as the "source-congruency" requirement, in order to maintain Byzantine resilience.