Redundant communication channels are fundamental to the operation of computer systems where high reliability is essential. For example, digital flight control systems of modern aircraft typically utilize three independent digital computers operating redundantly. Each computer performs a set of calculations and compares its results with those of the other two. Interprocessor communications are conducted over three independent message channels so that no two computers send information on the same channel.
With three independent computational results, each of the redundant computers can perform consistency checks to ensure that the computations are identical. If one of the computers has calculated different results, a hardware or software fault has occurred within some part of the system. In the event of a fault, a two-out-of-three majority "vote" determines which computational results are to be accepted as correct. In addition, discrepancies between the redundant computations can be used to determine within which of the redundant channels the fault has occurred. A faulty channel can then be electronically isolated to prevent further use of its erroneous computations.
If at least two channels of a triple redundant system are operating properly, the overall system will continue to operate without error. At specific time intervals, each computer of the triple redundant system transmits its computational results to and receives data from the other two computers. The computers exchange information in a predetermined time window, during which all copies of the data must be received. Thus, each computer is loosely synchronized with the other computers in the redundant group. Each computer compares its results with the results of the other two by using a software algorithm that identifies and corrects any discrepancies. The results accepted as valid are then used for further system operations.
In addition to triple redundant flight control systems, other examples of prior art redundant computer systems include the Software Implemented Fault Tolerant (SIFT) and the Fault Tolerant Multiprocessor (FTMP) systems.
Like triple redundant systems, the SIFT system also uses direct communication channels between the various computers in the system. However, SIFT is a step beyond basic triple redundancy in that the computers can be dynamically combined into redundant processing groups under software control. This allows graceful degradation because low priority tasks can be terminated to create computational resources to replace failed units. In addition, SIFT can dynamically control the level of redundancy (e.g. dual, triple, quad, etc.) under which each system task is executed.
For a given SIFT configuration, each processor within a redundant group exchanges computational results with the other members of the group. The computers exchange information during a predetermined time window, during which time all copies of the data must be received. Each computer is loosely synchronized with the other members of the redundant group. After data has been exchanged, the results are compared using a software algorithm to identify discrepancies and to designate correct results for further use by the system.
The FTMP system allows computers to be dynamically combined into redundant groups, although the redundancy level is fixed for all system tasks. Computational results are compared during each access to global memory. As a result, the computers within a redundant group are tightly synchronized so that each global memory reference is performed by all participating processors simultaneously.
The FTMP system differs from the foregoing in that the redundant data comparison and correction functions are performed automatically in hardware. This is made possible by the tight synchronization of the message channels. Hardware implementation of these functions results in approximately a 20 to 1 reduction in system overhead for FTMP compared to SIFT systems.
Message comparison in FTMP is performed on a serial bit basis. Error detection and correction is implemented by logic gates. FTMP is also unique in that the data buses forming the inter-processor communication network are multiplexed between the triple redundant clusters, rather than being point-to-point as in SIFT. This results in a significant reduction in hardware, permits a higher degree of fault tolerance, and provides better reliability through incorporation of spare data buses.
In general, inter-processor communication in redundant computer systems requires significant overhead in both system hardware and computing resources. Hardware complexity is increased as the number of unique communication channels is increased. Computational resources are required to manage and control the exchange of redundant data and to perform consistency check and error correction operations. As a result, prior art redundant computer communication systems have the following basic limitations:
1. Point-to-point communication channels require N(N-1) communication channels to accommodate N redundant computers;
2. Redundant communication channels are dedicated to specific computers, so that a fault within a computer causes all of its dedicated channels to be unusable, and
3. A significant percentage of available computer throughput is consumed in managing and controlling the transmission, reception, comparison, and correction of redundant data.
Thus, there is a need for a redundant inter-processor communication system that reduces required communication hardware, provides multiplexed communication channels, and performs data comparison and correction operations in hardware. A major goal is to reduce the system complexity and computational overhead required to manage and control redundant computer elements.