In a distributed computing system, processes from heavily loaded machines are often migrated to lightly loaded machines in order to utilize the computing resources more efficiently. Such load sharing is especially useful when long-running non-interactive processes are initiated at a given machine. These long-running processes can take up a large number of central processing unit (CPU) cycles for an extended period of time, and thus slow down shorter interactive processes that are submitted to that machine. If the system includes idle remote machines accessible over a network, the long-running applications could be migrated from their local host machines to one or more remote machines. A given migrated process may then be subsequently preempted, transferred to another remote machine, and restarted from a checkpointed state if, for example, an interactive process arrives in the first remote machine or a failure occurs in the first remote machine. The arrival of an interactive process generally necessitates the termination of all remote non-interactive processes if it is required that an interactive user not be slowed down by remote non-interactive processes for which he or she is not the owner. This type of preemption and transfer can continue until the migrated process is completed.
A number of conventional systems implement load sharing based on the above-described process migration techniques. The operation of these conventional systems can be generalized as follows. When a host machine makes a decision to migrate a non-interactive process, a single execution of this process is started on a designated remote machine. This process is then periodically checkpointed. If there is an interactive process arrival on the remote machine or a process or machine failure occurs on that machine, the migrated process is terminated and restarted from the previous checkpoint on another remote machine, or on the same remote machine in case of a process failure. This technique thus involves a "rollback" to the previous checkpoint, and as a result a considerable amount of computation time may be wasted. Conventional periodic checkpointing techniques therefore fail to provide optimal performance in terms of minimizing the expected completion time of a migrated process.
It has also been suggested to implement on-demand checkpointing to overcome the above-noted problems associated with conventional periodic checkpointing. However, proposed implementations of on-demand checkpointing have heretofore generally been viewed as complete replacements for periodic checkpointing.