In many systems which process information, particularly systems used for making critical decisions in real time, it is necessary that such systems have a high degree of reliability, such that the allowable probability of a failure per unit time is extremely low. Such systems, as may be used in aircraft, space vehicles, medical applications, and the like, also demand a substantially high level of processing performance. Such performance includes not only the provision of a high data throughput and large memory capability, but also the ability to satisfy whatever unique requirements are imposed by the real time operational environment. Thus, the processing architecture must be designed to be capable of adapting itself to the varying requirements of the task being performed in real time. Conventional redundant processing systems which can normally be used for many applications often do not have a sufficient degree of reliability to be safely used in highly critical applications, and it is desirable to provide new approaches to the problem of fault tolerance, particularly where more than one fault may have to be tolerated, e.g., in systems where a fault which arises cannot be corrected before another fault arises. While such conditions may be relatively rare, in critical applications the existence of such conditions, if not fully corrected, may give rise to extremely costly and even life-threatening malfunctions of the overall system.
Further, a conservative, but not unrealistic, model of failure behavior is to consider failures which may arise because of arbitrary behavior on the part of one or more failed components. Such failures are often referred to a Byzantine faults or as giving rise to "malicious" errors. Undiagnosed malicious errors or Byzantine failures may occur without detection, and many systems simply are incapable of detecting and correcting for such faults.
It has been determined by those in the art that certain requirements must be met in order to avoid not only normal faults but also such Byzantine or malicious faults. Such requirements can be summarized as the following criteria in which the term "f" represents the number of faults which are to be simultaneously handled:
1. There must be at least (3f+1) redundant processing participants (i.e. processing elements) for the algorithm which is being implemented, each participant residing in a different fault containment region. PA1 2. Each participant must be connected to each other participant through at least (2f+1) disjoint communication paths. PA1 3. There must be a minimum of (f+1) rounds of communication among the participants in the execution of the algorithm. PA1 4. The participants must be appropriately time synchronized in their operations with respect to one another to within a known time skew of each other. PA1 5. In order to mask simultaneous faults by voting of the output of redundant executions in the computation, (2f+1) executions, i.e., (2f+1) processing elements are required to guarantee that a majority of non-faulty executions exist.
A system which satisfies such criteria and is thereby capable of executing the appropriate algorithm is often called a "f-Byzantine resilient" system. Currently available high-throughput systems do not appear to meet all of the above requirements and it is desirable to provide a system which can do so.
One such approach to providing a Byzantine resilient processing system has been described in the article: "Advanced Information Processing System" by J. H. Lala, AIAA/IEEE, 6th Digital Avionics Conference, Baltimore, Md., Dec. 3-6, 1984. In the systems described therein, for example, the theoretical requisite regions of connectivity between them are provided between members of a single redundant processing site composed of at least (3f+1) processing elements. Such minimally-sized processing sites are then connected to each other to form a parellel processing ensemble using inter-group links which do not possess the requisite (2f+1) connectivity. In such a technique, members of different redundant groups cannot be grouped together to form new redundant groups as members of a former group fail. For example, if a processing element in a first redundant processing site and another processing element in a second redundant processing site were to fail, inadequate connectivity exists between the surviving members of both processing sites to permit the formation of another redundant processing site. Thus, the significant potential for attrition resiliency and graceful degradation of performance in the operation of the system cannot be achieved.
Another approach is shown in the article: "SIFT, Design and Analysis of a Fault-Tolerant Computer for Avicraft Control" by J. H. Wensley et al, Proc, of the IEEE, Vol. 66, No. 10, October 1978, in which each processor element of the system is connected to each other processor element of the system in a fully-connected overall network. Any processor element can then be grouped to any other processor element to form a redundant processing site. As long as there are (3f+1) non-faulty processing elements, an f-fault masking redundant processing site may be configured. However, the cost and complexity of providing full interprocessor connectivity increases quadratically with the size of the system, i.e. the number of processing elements in the system, thereby rendering such an approach far too costly in terms of the number of communication links and ports required for a relatively large size system. In addition, the processing elements comprising a redundant processing site are responsible for the fault tolerance specific functions, whose execution consumes a significant portion of the computational throughput, thereby reducing the overall information throughput which is desired.
Accordingly, it is desirable to design a new Byzantine resilient fault-tolerant system which avoids the disadvantages of the above approaches and provides attrition resiliency and graceful performance degradation at a reasonable complexity and cost.