Increasingly, the users of software applications are demanding that the software be tolerant to faults. In particular, users are concerned with two components of fault tolerance: availability and data consistency of the application. For example, users of telecommunication switching systems demand that the switching systems are continuously available. For transmissions involving financial transactions, however, such as for bank teller machines, customers also demand the highest degree of data consistency.
Due to the complex and temporal nature of interleaving messages and computations in a distributed system executing a plurality of concurrent processes, however, no amount of verification, validation and testing during software debugging will detect and eliminate all software faults and give complete confidence in the availability and data consistency of that application. Accordingly, residual faults due to untested boundary conditions, unanticipated exceptions and unexpected execution environments have been observed to escape the testing and debugging process and, when triggered during program execution, will manifest themselves and cause the application process to crash or hang, thereby causing service interruption.
It is therefore desirable to have effective on-line retry mechanisms for automatically detecting and bypassing such software faults, in order to allow recovery from the software failures. Several studies have shown that many software failures in production systems behave in a transient fashion. Accordingly, the easiest way to recover from such failures is to restart the application process and thereby execute the same process under different conditions, an approach often referred to as environment diversity. However, restarting a system often involves a comprehensive initialization procedure and thus may require a potentially considerable service disruption.
Thus, in order to minimize the amount of time lost in restarting a system, numerous checkpointing and rollback recovery techniques have been proposed to recover more efficiently from transient hardware failures. For a general discussion of checkpointing and rollback recovery techniques, see R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., Vol. SE-13, No. 1, pp. 23-31 (Jan. 1987). Generally, a checkpoint is a periodic backup copy of the data associated with an application process, which allows the application process to be restarted from the checkpoint which has been stored in a backup memory device.
Few, if any, checkpointing and rollback recovery techniques, however, have been proposed to recover from transient software failures. It is submitted that the rollback techniques previously developed for transient hardware failure recovery can also be used to recover from software errors by exploiting message replay and message reordering to bypass software faults.
As is apparent from the above discussion, a need exists for a progressive retry system for bypassing transient software faults which minimizes the scope of the roll back, including the number of processes involved in the roll back and the total roll back distance, in order to achieve faster recovery. A further need exists for a recovery system that progressively increases the rollback distance and the number of affected processes when a previous retry attempt fails in order to gradually increase the degree of nondeterminism with each retry step.