Many present-day systems need to provide a reliable service to their clients. Examples include e-commerce systems that provide online shopping or reservation services, online banking systems, network control systems, database servers, and web servers. Such systems are often constructed to be fault-tolerant. That is, the system should be able to provide an uninterrupted service even if a component or subsystem becomes inoperative.
Individual components may fail for different reasons, which include hardware and software failures. If a failure occurs in a resource that is shared among multiple components, it might affect all of those components. For example a hardware failure in a power supply that services multiple components may bring all those components down. Some types of failures may only occur under special conditions, and under those conditions they may affect more than one component at the same time. Examples of these conditional failures include some software failures due to software bugs, for example the Y2K bug or the start or end of daylight saving time.
In order to achieve tolerance against single component failures, systems may utilize multiple redundant nodes (alternatively called agents, modules or hosts) to provide each service. Typically these independent nodes are built from independent components and are supplied by independent power sources. Each node alone is capable of performing all the tasks needed in the service, even if all other nodes fail. In other words, each node is a duplicate of any other node for performing those specific tasks. A failure in any one node does not affect the ability of any other node to provide the complete service, because the components of various nodes are independent of each other. The risk of a multiple failure that disables all nodes at once can be made as small as desired by providing a large number of independent nodes.
Making the various nodes independent of each other is complicated because the nodes must communicate with each other. Communication among the nodes is required for several reasons. First of all the nodes must agree among themselves which of their number is handling any particular client's request for service. Moreover, any node that changes the ongoing state of the service must inform other nodes of that change. For example, in an airline reservation system, if a client wishes to book an airline seat, one and only one node should be taking an available seat from the pool and reserve it for the client. Then the node that books the seat must inform other cooperating nodes that the seat is now taken. Finally, the communication is needed in case of a failure of an active node, so that one of the other nodes may take over the failed node's responsibilities.
For mission critical systems, it is important to equip the system with communication fabrics that are themselves robust and fault-tolerant. Such robustness is needed to achieve a consistent service to the client and to avoid catastrophic errors. One example of a catastrophic error, termed a split-brain syndrome, occurs when two or more duplicate nodes lose connection with each other, and each assumes, falsely, that the others are not functioning and thus decides to serve all requests by itself. Such a situation might result in serious errors, like double booking, double charging of a client, memory corruption, or database errors. To provide a robust communication fabric, the nodes are often connected to each other and to the outside world through multiple redundant paths, such that when one of the paths fails, another path can be utilized by the nodes to communicate with each other and with the outside world.
Even in the presence of robust communication fabrics, individual nodes in the system may fail and cease to perform their tasks. Thus the system must have a way to find out that there is a failure, and further to determine which node or communication path has failed and should be removed from the service, and which is still healthy and thus should survive to continue to offer service. If by mistake the system removes a healthy component from service instead of the one that has failed, then not only will the system have less of the healthy components than before the failure, but also the failed component will remain operational and capable of causing further damage.
One typical solution to the problem of fault tolerance is called triple modular redundancy (TMR), which uses a majority vote mechanism. In TMR at least three nodes are provided to perform any service. That way, if there is a failure in any one of the nodes, its behavior will differ from that of both of the other two healthy nodes, and those two constitute a majority of the three nodes. A comparison among outputs if all nodes is performed by at least three health monitors, each of which monitors and compares the behavior of all the nodes, using independent communication paths between all monitors and all nodes. This approach is more costly and complicated than an approach using only two nodes in the system.