(1) Field of Invention
This invention relates to fault-tolerant systems and methods. More particularly, the invention relates to fault-tolerant systems and methods using optimistic logging with a synchronous recovery in message passing systems.
(2) Description of Prior Art
Log-based rollback-recovery is an effective technique for providing low-cost fault tolerance to distributed applications. See Appendix I, 1, 3, 12, 7, 4!. It is based on the following piecewise deterministic (PWD) execution model 12!: process execution is divided into a sequence of state intervals each of which is started by a non-deterministic event such as message receipt. For simplicity it is assumed that message-delivering events are the only source of non-determinism in this invention. The execution within an interval is completely deterministic. During normal execution, each process periodically saves its state on stable storage as a checkpoint. The contents and processing orders of the received messages are also saved on stable storage as message logs. Upon a failure, the failed process restores a checkpointed state and replays logged messages in their original order to deterministically reconstruct its pre-failure states. Log-based rollback-recovery is especially useful for distributed applications that frequently interact with the outside world 4!. It can be used either to reduce the amount of lost work due to failures in long-running scientific applications 4!, or to enable fast and localized recovery in continuously-running service-providing applications 5!.
Depending on when received messages are logged, log-based rollback-recovery techniques can be divided into two categories: pessimistic logging 1, 5! and optimistic logging 12!. Pessimistic logging either synchronously logs each message upon receiving it, or logs all delivered messages before sending a message. It guarantees that any process state from which a message is sent is always recreatable, and therefore no process failure will ever revoke any message to force its receiver to also roll back. This advantage of localized recovery comes at the expense of a higher failure-free overhead. In contrast, optimistic logging first saves messages in a volatile buffer and later writes several messages to stable storage in a single operation. It incurs a lower failure-free overhead due to the reduced number of stable storage operations and the asynchronous logging. The main disadvantage is that messages saved in the volatile buffer may be lost upon a failure, and the corresponding lost states may revoke messages and force other non-failed processes to roll back as well.
Although pessimistic logging and optimistic logging provide a tradeoff between failure-free overhead and recovery efficiency, it has traditionally been only a coarse-grain tradeoff; the application has to either tolerate the high overhead of pessimistic logging, or accept the inefficient recovery of optimistic logging. In practice, it is desirable to have a flexible scheme with tunable parameters so that each application can fine tune the above tradeoff based on the load and failure rate of the system. For example, a telecommunications system needs to choose a parameter to control the overhead so that it can be responsive during normal operation, and also control the rollback scope so that it can recover reasonably fast upon a failure.