This invention relates to improved means and methods for providing fault tolerance in a data processing system.
As computer systems increase in speed, power and complexity, it has become of increasing importance to provide fault tolerance in such systems to prevent the system from "going-down" in the event of hardware and/or software failure. However, providing fault tolerant capabilities in a computer system has proven to be expensive as well as introducing significant performance penalties.
A basic way of achieving fault tolerance in a data processing system is to provide each task (also called a process) with a backup task such that, if the primary task fails, the backup task is automatically able to recover and continue execution. For example, a primary task and its backup task could be provided using a pair of simultaneously executing CPUs (central processing units) intercoupled such that, if one fails, execution continues on the other. It will be appreciated that the need to provide such duplicate hardware is a very expensive way of achieving fault tolerance, particularly since the simultaneously operating duplicate hardware cannot be used to provide additional data processing power.
One known approach for avoiding hardware duplication is to provide a first CPU for the primary task, and a second CPU for the backup task, the backup becoming active to recover and continue execution only if the primary fails. Until then, the backup CPU can do other processing. In order to assure that the backup process can take over in the event the primary process fails, this known approach provides for a checkpointing operation to occur whenever the primary data space changes. This checkpointing operation copies the primary's state and data space to that of the backup so that the backup task will be able to continue execution if the primary task fails. However, the frequent checkpointing required by this approach detrimentally affects performance and also uses up a significant portion of the added computing power.
Another known approach is disclosed in U.S. Pat. No. 4,590,554. Although this approach also uses checkpointing, it provides the advantage of employing a fault tolerant architecture which significantly reduces the frequency of checkpointing. However, the approach has the disadvantage of requiring a message transmission protocol which is essentially synchronous in that it requires messages to be transmitted to primary and back-up processors substantially simultaneously. Also, the disclosed approach in the aforementioned patent has the additional disadvantage of requiring atomic transmission, wherein transmittal of a message by a task is not allowed unless the receiving tasks and all backups indicate the are able to receive the message. Furthermore, no receiving task is allowed to proceed until all receiving tasks and backups have acknowledged receipt of the message. These message transmission protocol requirements introduce constraints that add complexity to the system, as well as having a significant detrimental effect on performance.
Similar approaches to that disclosed in the aforementioned patent U.S. Pat. No. 4,590,554 are described in an article by A. Borg, et al., "A Message System Supporting Fault Tolerance," Ninth Symposium on Operating Systems Principles (Breton Woods, N.H., Oct. 1983), Pages 90-99, ACM, New York, 1983, and in an article by A. Borg, et al., "Fault Tolerance Under UNIX," ACM Transactions on Computer Systems, Vol. 7, No. 1, February 1989, pages 1-24.