Distributed computer systems have recently been implemented in the control systems of public switched telephone exchanges. The control of a telephone exchange is then distributed to several computers, which are connected with a relatively high-speed bus or equivalent transmission means. In telephone exchanges or other switching systems of this type, an effort is made to support system operations by replicating at least part of the distributed control computers. This should enable replication of the control computers in a way that does not unreasonably consume the performance capacity of the control computers, or require costly special equipment. Requirements for the replication of distributed computers in a telephone exchange environment are disclosed in the article "New fault-tolerance design: developments in software system architecture of the Nokia DX 200", Raimo Kantola, Discovery, Volume 22, First Quarter 1991, pp. 32-39.
The following solutions, for example, have previously been employed in the replication procedures of distributed computer systems. In a solution utilizing microsynchronization of computers, for instance, two computers controlled by special equipment execute precisely the same computer instructions at precisely the same moment. The advantage of the microsynchronization method is its transparency to the application software. Its disadvantage is the costliness of the special equipment and the difficulty, even impossibility, of effectively applying the method to N+1 supported computers, particularly when N is at least several dozen. In the N+1 redundancy mode, N similar computers perform similar, yet independent useful tasks using the same software. One computer is a spare unit to be taken into use if one of said N computers fails, or for example, when the operational arrangements of the switching system so require. The advantage of the N+1 equipment redundancy method is its cost-effectiveness and its compatibility with the 2N method, with the difference that the connection of the spare unit to the active unit must always be carried out prior to unit changeover.
Solutions have previously also been implemented in which the redundant execution of the entire control has been left in charge of the application software, so that the state automaton of the program has comprised the necessary state transitions for keeping the spare unit up-to-date with that which is executed by the active computer. The drawback of this replicating method is that the application must solve two problems at a time: its actual task, and the support therefor. This complicates the development of applications. Another drawback is that this replication method does not result in a uniform execution, as a result of which the maintenance of the applications is costly.
Methods shielding the replications from the applications have also been developed previously, and these have aimed at correctness of computation in all performing computers, and they have therefore been heavy and consumed the performance capacity of the computers. These solutions have given the correctness of the performance a priority over the availability of the system. Therefore they are not particularly suitable for a switching environment, such as a telephone exchange, where high availability is more important than absolute correctness of every discrete minor function.
A replication method is known from Finnish Patent application 912669 in which the processes performed by two or more computers operating in parallel are replicated in groups containing as many subprocesses as possible, so that the corresponding subprocesses within the corresponding groups of two units operating in parallel operate independently (asynchronously) of each other, but there is no inconsistency between the processes performed by the parallel subprocesses. This method is based on multicast message handling between the processes where the information vital to the processing and transmitted between the processes of the master unit is simultaneously also delivered to the corresponding processes of the spare unit operating in parallel. In that case, computers that operate in parallel execute the same program virtually simultaneously, so that the computers externally seem to transmit and receive the same messages in the same order. The method does not aim to guarantee the instantaneous correctness of two executions, but that the operation executed by computers operating in parallel is not inconsistent with the operation executed by the master computer of the group. This reduces the load caused to the computers by replication without a need for any special equipment, except a data transmission bus connecting the computers, which is required for a distributed switching system in any case.
In a prior art message-based replication method of this type, a hot-standby process must initially be created in parallel with the active process to be replicated, and the hot-standby process must be brought to the same dynamic state as the active process to be replicated. On the level of computer units this means taking the spare unit out of the cold-standby state to the hot-standby state by first bringing the spare unit to a so-called initial steady state and further to a state which is consistent with the active unit. The initial steady state is achieved by loading the appropriate program codes and data files into the spare unit and by initiating the master processes. In this initial state, all stateless processes are already in the actual operative state. Instead, all state-oriented processes must further be brought from the initial steady state to a state which is consistent with the active unit. This procedure is termed a warm-up procedure of a process or a computer unit. The warm-up procedure may be passive or active. The passive warm-up refers to creating new computations as replicated computations, and with time, the number of equivalent computations in the spare unit comes closer and closer to the total number of the parallel computations in the active unit. The passive warm-up procedure, however, does not give any guarantee of that the spare unit will ever reach a consistent state with the active unit, i.e. that the passive warm-up procedure will end successfully, and the passive warm-up procedure does not provide any final criterion for the warm-up process either. For this reason, and since the passive warm-up may last too long, the active warm-up is needed.. The active warm-up refers to a procedure in which the current values of the state variables of the state-oriented processes of the active unit are copied to the corresponding state variables of the spare unit. The active warm-up procedure also provides a criterion for that when the warm-up has terminated successfully.
A number of demands are made on the active warmup. Warm-up procedures should be applicable to all or at least the most of the applications for cloning the dynamic state of the active computations, i.e. transferring the computations to the spare unit without stopping the active computation controlled by external processes in the master unit. The warm-up procedures should be as transparent as possible to the applications. In addition, the warm-up procedures should cause as little disturbance as possible to the operation of the active unit, and they must never cause errors in the computations of the active unit, and they should end once the spare unit has reached a consistent state.