A computer in operation includes hardware, software, and data. The hardware typically includes a processor, memory, storage, and I/O (input/output) devices coupled together by a bus. The software typically includes an operating system and applications. The applications perform useful work on the data for a user or users. The operating system provides an interface between the applications and the hardware. The operating system performs two primary functions. First, it allocates resources to the applications. The resources include hardware resources—such as processor time, memory space, and I/O devices—and software resources including some software resources that enable the hardware resources to perform tasks. Second, it controls execution of the applications to ensure proper operation of the computer.
Often, the software is conceptually divided into a user level, where the applications reside and which the users access, and a kernel level, where the operating system resides and which is accessed by system calls. Within an operating computer, a unit of work is referred to as a process. A process is computer code and data in execution. The process may be actually executing or it may be ready to execute or it may be waiting for an event to occur. The system calls provide an interface between the processes and the operating system.
Checkpointing is a technique employed on some computers where processes take significant time to execute. By occasionally performing a checkpoint of processes and resources assigned to processes, the processes can be restarted at an intermediate computational state in an event of a system failure. Migration is a technique in which running processes are checkpointed and then restarted on another computer. Migration allows some processes on a heavily used computer to be moved to a lightly used computer. Checkpointing, restart, and migration have been implemented in a number of ways.
In The Design and Implementation of Zap: A System for Migrating Computing Environments,Proc. OSDI 2002,Osman et al. teach a technique of adding a loadable kernel module to a standard operating system to provide checkpoint, restart, and migration of processes implemented by existing applications. The loadable kernel model divides the application level into process domains and provides virtualization of resources within each process domain. Such virtualization of resources includes virtual process identifiers and virtualized network addresses. Processes within one process domain are prevented from interacting with processes in another process domain using inter-process communication techniques. Instead, processes within different process domains interact using network communications and shared files set up for communication between different computers.
Checkpointing in the technique taught by Osman et al. records the processes in a process domain as well as the state of the resources used by the processes. Because resources in the process domain are virtualized, restart or migration of a process domain includes restoring resource identifications to a virtualized identity that the resources had at the most recent checkpoint.
While the checkpoint, restart, and migration techniques taught by Osman et al. show promise, several areas could be improved. In particular, communication state that exists outside of the process domain at checkpoint may need to be restored.