There are many software applications, notably in the field of telecommunications, that require fault tolerance (referenced "FT" in the rest of this specification) in order to ensure continuity of operation.
To provide fault tolerance, it is known to duplicate a process running on a platform, so as to obtain an active process and an standby process. If the active process goes down, the standby process is promoted to the active state, and all communications with the formerly active process are switched to the new active process.
It was also suggested to have a standby process for several active ones, and to promote similarly the standby process to an active state when one of the active processes goes down.
Another known solution is to provide several active processes, without any standby process: when one of these active processes goes down, its functions and activity are switched to one or several other active processes.
In the prior art, all these solutions are provided on an application basis. In other words, each application running on a platform will manage its own active and/or standby processes, independently of any other application that may also have fault tolerant capabilities. This is resource-consuming, and also involves providing fault-tolerance separately in each application.
Accordingly, it is an object of the invention to reduce the need for resources in fault tolerant platforms, and also to reduce the need for providing fault tolerance in every application.
Another problem in such fault tolerant systems is the so-called split-brain syndrome. This problem occurs in fault tolerant systems, if at least one process has a faulty view of reality. For instance, this may occur if there is a communication problem between nodes or processes; the communication problem may have any origin, such as a physical breakdown of the communication link, a bug in one or several processes, a time-out affecting one or several process, or traffic congestion.
In this case, several process may come into conflict for resources. In the example of a communication problem between a standby and an active process, the standby process may consider the active process dead and try to grab resources, while the active process with continue using the same resources. This usually leads to catastrophic or unstable situations.
A quorum solution to split-brain syndrome has been proposed, in a system with more than two nodes or processes in communication. Where a communication problem creates two groups of nodes or processes, thus leading to a possible split brain syndrome, a decision rule determining that it is the group with the majority of nodes which should remain active is applied by all the nodes or processes.
This quorum method will not work for systems in which there is an even number of nodes, where the communication failure creates two groups comprising of the same number of nodes or processes. The quorum method is also inefficient for multiple communication failures that may create several groups of nodes, none of which has the quorum majority. Finally, the quorum method will not work where only two processes are in communication.
Accordingly, it is further object of the invention to provide a simple solution to the split-brain syndrome, that will prove efficient in the above configuration, and especially in the case of a communication between an active and a standby process in a fault tolerant system.