The present invention relates to fault-tolerant computer systems. In particular, it relates to reducing the memory required to operate a fault-tolerant system.
Processes run on computers are used to obtain many useful results. For instance, computer processes can be used for word processing, for performing calculations, for banking purposes, and for routing messages in a network. A problem with computer processes is that sometimes a process will fail. Although for some programs failure may have minimal negative consequences, in other cases, such as a banking application, the negative consequences can be catastrophic.
It is known to use fault tolerant processes to enable recovery from failure of a process. In particular it is known to use traditional process pairs where one process is the working process doing work and the other process is essentially a clone of the working process that takes over if the working process fails. See e.g. xe2x80x9cTransaction monitor process with pre-arranged modules for a multiprocessor systemxe2x80x9d, U.S. Pat. No. 5,576,945, issued Nov. 19, 1996. The working process at intervals sends information about its state (xe2x80x9ccheckpointsxe2x80x9d) to the backup process. (In process pairs, a checkpoint is sent at a minimum when an external state relating to the process is changed, such as when a file is opened or when a banking program does a funds transfer. Usually checkpoints are sent much more frequently, however.) Upon failure, the backup process begins execution from the last checkpointed state.
A problem with using traditional process pairs is that because a redundant process is set up about double the memory of running a single process is required. A copy of the contents of the memory image of the working process is created by the clone, including the state of the working space memory such as the stack. A copy of the program (xe2x80x9ccode segmentxe2x80x9d) is also maintained in memory. The code segment typically is an object file read from disk and loaded into memory at run time, and executed. The code segment is typically a relatively large portion of the memory image copy.
Memory is expensive and also takes up space. Accordingly, it would be advantageous to have a way to run fault-tolerant processes using less memory. It would further be advantageous for the time to takeover for a failed process to be short.
Systems and methods for implementing a memory-efficient fault tolerant computing system are provided by virtue of one embodiment of the present invention. A generic backup process may provide fault tolerance to multiple working processes. The backup process need not include a copy of the code segments executed by the working processes, providing very large savings in memory needed to implement the fault tolerant system. Alternatively, multiple backup processes provide fault tolerance but need not include duplicated code segments for the working processes they support.
In one embodiment, backup processes maintain state information about each working process including the contents of stack memory and heap memory. Checkpoint messages from a working process to a backup process keep the state information updated to facilitate takeover on failure. At takeover on failure, a backup loads a code segment associated with the working process and resumes using the current backup state information. With recent advances in processor speed, loading of the code segment occurs very quickly.
In one embodiment, a method for recovery of an original working process upon failure is provided. State information associated with the original working process is obtained. A copy of a code segment associated with the original working process is obtained and loaded into memory. The code segment is caused to execute as an active working process, using the state information.
A further understanding of the nature and advantages of the inventions herein may be realized by reference to the remaining portions of the specification and the attached drawings.