With record and replay of applications, the goal is to allow the simultaneous identical execution of an application, for instance, on different machines. This implies that not only the execution has to be reproduced identically, but it must occur at nearly the same time on a different host, despite the constraint of being remote (network latency, bandwidth) and with a minimal performance degradation.
On the other hand, the operating systems running on multi-processor machines able to operate in parallel must be adapted in order to allow record and replay of an application which is executing non deterministic events. Between those events, the application execution depends only from its initial state and program instructions and is, therefore, deterministic. In the case of parallel architecture, such as a multi-processor computer or a network comprising a number of computers running in parallel, the use of shared resources accessible by a plurality of tasks adds a cause of non-determinism: the ordering of access to a shared resource by concurrent tasks.
In the simple case where a particular instruction or system call returns a non-predictable result, it is sufficient to instrument this operation in order to record its result during the original execution and at replay, to simulate it and to force its result from the recorded value. A set of instructions and system calls which are deterministic on private unshared memory become totally non-deterministic when operating on shared memory because of the uncertainty of the initial state caused by the concurrent use of memory by other tasks, as described above. Rather than instrumenting each and every program instruction, the same applicant has proposed a method to ensure the exclusive access to the shared memory during a scheduling period by a single task, thus restoring the deterministic property of an instruction block, as described in the international patent application ‘Method for optimizing the logging and replay of multi-task applications in a mono-processor or multi-processor computer system’ published under the number WO 2006/077260. As described in this patent application, during the recording session, one fifo queue per CPU is used for recording each task schedule period information and one fifo queue per shared resource is used for recording each exclusive access to that shared resource during task execution. During the replaying session, the logging data of fifo queues transmitted to the replay machine are serialized to constitute the replay scheduling. The events are replayed according to the replay scheduling on each record from a CPU fifo generating a stop of the corresponding task execution.
A record of a task scheduling period in one CPU fifo contains the information on the event having caused task interruption: the event can be a system call interrupt, a scheduler interrupt or a shared resource access interrupt. At replay, if the event from a CPU fifo is a scheduler interrupt (called UIC because it uses user instruction count), then an interrupt is programmed to force the task to stop at the correct instruction count before resuming the task. The interrupt will be either triggered by a performance monitoring counter register overflow (the PMC counting user instructions) or a software breakpoint. After the task resumes and suspends again, the task state is matched against the expected stop condition.
Three possible results can occur from the match. The first possible result involves an unexpected scheduler, or breakpoint interrupt, before the next stop condition: the task needs simply to be resumed. The second possible result involves unexpected shared resource access, or system call interrupt. For example, the replay session has diverged before and is now entirely wrong. This is a replay error, The third possible result is an expected stop condition. The replay can proceed and the next event can be de-queued from the log.
Thus, with the solution of the prior art patent application, all the interrupts of multi-task applications are logged and replayed accurately in such a multi-processor environment.
However, logging too many events is costly and has a negative impact on performances, especially with remote logging: event logging is costly from an amount of storage point of view and for transferring the information from the recording machine to the replaying machine when it is remote; the impact on performances comes from the time to record and replay and the time to transfer event information.
Within this model, it is also impossible to formally ensure that there will be enough room in log fifos to store all the necessary events until the end of the scheduling period because one cannot predict how many system calls or exclusive accesses to shared resources will be performed before the release.