The present invention is generally directed to methods for correcting synchronization faults in concurrently executed computer programs and, more particularly, to methods and systems for fault tolerance of concurrently executed software programs using controlled re-execution of the programs.
Concurrent programs are difficult to write. The programmer is presented with the task of balancing two competing forces: safety and liveness. Frequently, the programmer leans too much in one of the two directions, causing either safety failures (e.g. races) or liveness failures (e.g. deadlocks) Such failures arise from a particular kind of software fault (bug), known as a synchronization fault. Studies have shown that synchronization faults account for a sizeable fraction of observed software faults in concurrent programs. Locating synchronization faults and eliminating them by reprogramming is always the best strategy. However, many systems must maintain availability in spite of software failures. Concurrent programs include all parallel programming paradigms such as multi-threaded programs, shared-memory parallel programs, message-passing distributed programs, distributed shared-memory programs, etc. A parallel entity may be referred to as a process, although in practice it may also be a thread.
Traditionally, it was believed that software failures are permanent in nature and, therefore, they would recur in every execution of the program with the same inputs. This belief led to the use of design diversity to recover from software failures. In approaches based on design diversity, redundant modules with different designs are used, ensuring that there is no single point-of-failure. Contrary to this belief, it was observed that many software failures are, in fact, transient (they may not recur when the program is re-executed with the same inputs). In particular, the failures caused by synchronization faults are usually transient in nature.
The existence of transient software failures motivated a new approach to software fault tolerance based on rolling back the processes to a previous state and then restarting them (possibly with message reordering), in the hope that the transient failure will not recur in the new execution. Methods based on this approach have mostly relied on chance in order to recover from a transient software failure. In the special case of synchronization faults, however, it is desirable to do better.
It would therefore be desirable to be able to bypass a synchronization fault and recover from the resulting failure.
The present invention controls the re-execution of concurrent programs in order to avoid a recurrence of the synchronization failure. The invention provides a method of (i) tracing an execution, (ii) detecting a synchronization failure, (iii) determining a control strategy, and (iv) re-executing under control.
Control is achieved by tracing information during an execution and using this information to add synchronizations during the re-execution.
In accordance with the present invention, a method of providing fault tolerance in concurrently executing computer programs by controlling the re-execution of concurrent programs in order to avoid a recurrence of synchronization failures is provided, comprising:
(a) tracing the execution of concurrent programs;
(b) detecting synchronization failures resulting from said execution of the concurrent programs; and
(c) applying a control strategy, based on said detection of failures, for said execution of the concurrent programs.
Also in accordance with the present invention, application of a control strategy includes causing a re-execution of said concurrent programs under a control derived from tracing information during an execution, and wherein said control includes using said information to add synchronizations to said concurrent programs during re-execution.