The present invention generally relates to replication of executing programs from an application or modules of an operating system; more particularly, the present invention optimizes the process of recording events and transferring them from a primary machine on which programs are executed to a standby computer system.
It may be desirable to entirely replicate on a second machine, an application or an operating system running on a primary machine. Replication may be needed for program debugging purpose. Replication may be also needed for balancing workloads of systems for system management purposes. Also, replication may be needed because of a failure of the primary machine with the second machine, the standby machine, being used in replacement of the primary machine. System management and Fault Tolerant (FT) systems using replication have a need for performance. An instant replication may even be required in these cases.
The replication is achieved by recording and replaying events that produce non deterministic results. Events producing deterministic results are not recorded, as they can be reproduced by simple re-execution of the programs on the standby machine. Applications implementing communication protocols or transactional applications, such as server applications which communicate with the outside world, receive input information candidate to event logging for replication and generate output information. The output events have no need of being logged, they are just replayed by re-executing the application in the standby machine. On the contrary, when an internal or external input event occurs, the event is first locally logged and transferred to the standby system. The data transferred may be used immediately in an active-active FT model or used for a replay later in an active-passive FT model. The transfer of logged event data recorded on the primary system to the standby system where they will be replayed, must be done safely and efficiently.
It is always desirable and necessary for replication in fault tolerant systems to improve the efficiency of the main stream record of events in logs and transfer of this log. Assuming that we are able to implement record and replay locally on the primary machine, by storing the event log on the local storage system, the next step to achieve fault-tolerance is to be able to transfer in real-time all the necessary data (recorded events) to the standby machine. The most costly step in the process of replication for fault tolerant systems is this transfer of information between the primary and the standby systems; local logging costs around a few nanoseconds on 1 Go/s memory throughput CPUs as acknowledged transfer costs around dozens of microseconds on Ethernet Gbit link.
However, the process must be done safely, this requirement being achieved if the standby system is able to recover from a failure of the primary system no matter when the failure occurs. In case of failure of the primary machine, all the data must be made available for replaying the application until the point where the failure occurred, and then, the application can resume on the standby with no interruption being visible to the external world. During the execution of the application on the primary machine, a failure may happen at any moment. In particular, a failure may impact the log transfer itself, inducing the loss of critical replay data.
Existing log transfer systems have been designed, which are fast but unsafe, by not ensuring data integrity; such fast and unsafe solutions are not acceptable for FT systems. An example of such fast and unsafe protocol is UDP or Multicast IP, as used in multimedia stream broadcast systems.
Other standard solutions are safe but slow. One example is the TimesTen Database Transaction replication protocol over TCP-IP, by Oracle, when used in synchronous mode to ensure fault tolerance. To avoid the negative impact of the loss of the last recorded event, a possible solution, so called in the rest of the document the standard solution, is to transfer any event to the standby prior to its processing by the application on the operational primary machine. The primary machine then receives non-deterministic events or results only after they have been transmitted to the standby. This standard solution implies a latency in the application which corresponds to the logging and transfer of events, followed by the transfer and reception of the acknowledgment. This standard solution imposes a lot of serialization points in the application execution, or a large amount of data to be frequently transferred. Each input is delayed before being provided to the application.