As the demand for greater computing power and for greater availability of computer processing to users has increased at a tremendous rate in recent years, system designers have looked beyond the uniprocessor-based system to systems which include a collection of coupled processors. Such multiprocessor systems, in the form of distributed systems, are typically configured so that each processor can perform processing operations which can be communicated to other processors in the system or to external devices as appropriate.
In the various types of currently proposed multiprocessor system architecture, a major concern relates to the possible failure of one or more of the processors in the system and how the system may recover correctly and efficiently from such failure. In this regard, system recovery after failure--which determines the reliability of the system--is particularly difficult in the distributed system because some processors can fail while others continue to run.
Moreover, recovery is further complicated in that the processors interact, the operations performed by one processor depending on operations performed by other processors in the system. When recovering from a failure in one processor, the failed processor in various prior systems is rolled back to an earlier state saved at a checkpoint. Other processors must then be rolled back to earlier checkpoints as the system attempts to return to a consistent system-wide state. The rollbacks may be unbounded leading to a problem referred to as "cascading rollbacks" or the "domino effect". That is, in seeking a consistent state between processors in the system, one processor after another is driven to progressively earlier checkpoints with no predefined stopping point. Several approaches to system recovery which avoid the domino effect have been suggested.
One approach to achieving fault tolerance in a distributed multiprocessor system without the domino effect is based on a transaction model. According to this approach, computation is divided into units of work called "transactions" which satisfy several predefined assumptions. Because computation in many distributed systems cannot be structured as transactions, this approach is limited in application. In addition, the transaction approach appears more tailored to implementing a single logical process by executing non-interfering parts in parallel. For applications whose logical structure includes multiple logical processes, the transaction approach is relatively expensive. The following articles consider transaction models in which recovery is built into an operating system: "Recovery Semantics for a DB/DC System" Proceedings of the ACM National Conference, 1973, by C. T. Davies; "Recovery Scenario for a DB/DC System", Proceedings of the ACM National Conference, 1973, by L. Bjork; "The Recovery Manager of the System R Database Manager", Computing Surveys, volume 13, number 2, 1981, by J. Gray et al.; and "Guardians and Actions: Linguistic Support for Robust Distributed Programs", 9th Annual Symposium on Principles of Programming Languages, NM, 1982, by B. Liskov and R. Scheifler. An article by C. Mohan and B. Lindsay (Efficient Commit Protocols for the Tree of Processes Model Of Distributed Transactions", Proceedings of the 2nd ACM SIGACT/SIGOPS Symposium on Principles of Distributed Computing, 1983) relates to the synchronous logging and checkpointing of records into stable storage.
Another proposed approach to avoiding the domino effect is to synchronize checkpointing and communication. In an embodiment by Borg et al.--in "A Message System Supporting Fault Tolerance", 9th ACM Symposium on Operating System Principles, October, 1983--implementing this approach, the system is described in terms of processing units and processes. A "processing unit" is defined as a conventional computer. Processing units communicate by means of messages over some medium. Each processing unit runs its own copy of the operating system kernel to perform processes. A "process" is defined as an execution of a program that is controlled by the operating system kernel. In accordance with the Borg et al. embodiment, there is a primary process and a backup process each of which resides in a different processing unit. The primary process and the backup process contain identical code. At the start of computation and periodically thereafter, the state of the primary process is checkpointed by being copied to the backup process. Additionally, each input message received by the primary process is also provided to the backup process. If the primary process fails, the backup process executes messages stored since the latest checkpoint. This embodiment requires four-way synchronization upon each communication: the primary process and backup process of the sending processing unit, and the primary process and the backup process of the receiving processing unit. The Borg et al. embodiment cannot tolerate arbitrary multiple failures. For example, if the processing unit of the primary process and the processing unit of the backup process fail, recovery is impossible.
Another embodiment of the synchronized recovery approach described by J. R. Bartlett in "A `Non-stop` Operating System", 11th Hawaii International Conference on System Sciences, 1978, requires three-way synchronization and also does not tolerate arbitrary multiple failures.
In the synchronized recovery approach, the state of each process is checkpointed upon each message communication, so that rollback is never necessary. However, when on processing unit sends a message to another processing unit, neither can continue processing until both have logged a checkpoint. The synchronous recovery approach avoids the domino effect problem, however pays throughput and response-time penalties due to the required synchronization.
Hence, while the problem of reliable and effective recovery from failure in a distributed system has been considered, a general method and apparatus for recovering from multiple failures has not been taught--especially where message communication, processor operations (such as computation and message generation), checkpointing, and committing of output to external devices can proceed asychronously.