With the rapid technological developments in areas such as aviation, space travel, robotics, medical devices, and electronic financial systems, there is an increasing need for computer systems to be reliable and resilient to failure. Thus, there is an ever growing demand for reliable computing systems. Replicated computers executing identical operations can provide fault tolerance by comparing the outputs of each of the computers and determining which one of the computers may have generated an error during operation.
The fault tolerant parallel processor (FTPP) architecture provides replicated operation of commercial-off-the-shelf processing elements. This is accomplished by providing synchronization and data integrity services in a special purpose communication device called a network element, which links replicated processors and other elements in fault containment regions to the rest of the FTPP system. Currently, one of two canonical forms of reaching agreement in the presence of faults is used within the FTPP architecture: interactive convergence and interactive consistency.
Interactive convergence algorithms reach an agreement on a correct value by performing an average on the locally perceived values. The locally perceived values may be different for each observer, but the algorithm converges, within a known error bound, to the same result across all properly functioning observers. The benefit of convergence compared to interactive consistency is reduced rounds of communication.
Interactive consistency algorithms guarantee that all properly functioning observers see the same values and can then perform a value selection from identical data sets. The cost of removing the averaging error compared to an interactive convergence algorithm is increased rounds of communication.
In a triplex system, which includes three network elements and fault containment regions, interactive convergence algorithms do not have the resources needed to operate. Oral message versions of interactive consistency algorithms can be replaced by signed message versions for the triplex system to operate in Byzantine fault scenarios. However, once one of the fault containment regions fails leaving only two functional fault containment regions, an FTPP system can only continue to operate as a duplex system when clock duplication algorithms have been applied. This system is not fault tolerant. Many times there is a need to start as a duplex system for power conservation as there is limited battery power in some applications, such as the implantation of a medical device. However, for the critical applications where fault tolerance is required, the system must operate as triplex in order to execute those operations.