1. Field of the Invention
This application relates to an arrangement of a computer system, in particular, to a system and a method for acquiring, storing and using data concerning the state of hardware and software components within the computer system.
2. Description of the Related Art
Modern computers xe2x80x9ccrashxe2x80x9d with irritating frequency, with much work lost or recovered only with time-consuming effort. Sometimes, crashes or other errors are expected, for example, when designing new software or debugging an existing program. In such cases, and even when first turning the computer on, time is also lost waiting for computers to xe2x80x9cbootxe2x80x9d or xe2x80x9creboot.xe2x80x9d At other times, when problems occur for an ordinary user of a commercial application, even more time is often lost when the frustrated user must try to explain orally what has happened to a technician located far away in a customer service department. These are just a few of many possible examples of situations when information about the state of the computer system is either desirable, for example, when debugging a new program, or necessary, for example, when the computer is to reboot and automatically load previously running applications along with the data they were processing when exited.
One known attempt to ensure the ability to analyze and reconstruct the state of a physical memory, disk or data base is based on the concept of a xe2x80x9ctransaction,xe2x80x9d which involves on-going tracking of updates to at least one region of storage. In this context, a transaction is a collection of updates that are bundled together so that they are atomic that is, either all of the updates occur, or none of them occur. The idea of transactions is typically applied to databases, where a series of updates to different tables need to occur simultaneously.
A transaction proceeds as follows: A begin command from the operating system or an application marks the beginning of the series of updates that make up the transaction. After the updates complete, a commit command marks the end of the transaction and the updates become permanent. If an error occurs during one of the updates that are part of the transaction, a rollback command is used to undo any updates in the transaction that may have completed.
In the prior art, this use of the concept of transactions is commonly implemented in database systems. Recently, transactions have been extended to apply to logical disks (also referred to as virtual disks), which are a software construct that emulate physical disks. One example of this solution, in the context of a parallel or distributed processing arrangement, is described in U.S. Pat. No. 5,634,096 (Baylor, et al., May 27, 1997, xe2x80x9cUsing virtual disks for disk system checkpointingxe2x80x9d), which discloses a scheme for storing data on disks in such a way that a xe2x80x9ccheckpointxe2x80x9d is taken across several disks connected to different processors. This checkpoint is then used to restore the entire disk system to a known state after one or more of the disks or processors fails.
Yet another solution involving virtual disks is described in xe2x80x9cThe Logical Disk: A New Approach to Improving File Systems,xe2x80x9d by de Jonge, Kaashoek, and Hsieh, in Proceedings of the 141h ACM Symposium on Operating System Principles, pp. 15-28, December 1993. In this paper, the term xe2x80x9cAtomic Recovery Unitxe2x80x9d is used to describe transactions to the logical disk.
The implementation of a logical disk requires the interception of requests to the physical disk, and transforming them into operations on a logical disk. Once this has been accomplished, it is possible to keep a log of all of the updates to the logical disk and defer the update so that the original data is not overwritten. When the updates are kept in a log in this fashion, then a rollback can be accomplished by discarding the updates in the log for a particular transaction. A commit can be accomplished by retaining these updates in the log, and eventually applying them to the logical disk.
A similar concept has been proposed in xe2x80x9cPetal: Distributed Virtual Disks,xe2x80x9d by Lee and Thekkath, in Proc. 1 xe2x80x9cIntl. Conf. On Architectural Support for Programming Languages and Operating Systems,xe2x80x9d pp. 84-92, October 1996. The Petal virtual disk supports the ability to take snapshots of the virtual disk, using techniques known as xe2x80x9ccopy-on-write.xe2x80x9d Copy-on-write is a common technique that allows copies to be created quickly, using a table of pointers to the actual data, and only copying the data when it is modified by a user program.
In Petal, the virtual disk itself is implemented as a table of pointers, and the snapshot (equivalent to a xe2x80x9ccheckpointxe2x80x9d) is implemented by including an identifier (called an epoch number) in this table. When a snapshot is taken, the current epoch number is assigned to the snapshot. The epoch number is then incremented, and all subsequent updates to the virtual disk belong to this new epoch number. When a block of the disk is next updated, there will be no copy at the current epoch number, so a copy of the block will be created. In short, as the term xe2x80x9ccopy-on-writexe2x80x9d implies, a copy is made only when a disk block is written to. The original data is still available, under the epoch number of the snapshot.
Both the logging technique and the snapshot technique allow the implementation of transactions on a logical disk. In both cases, there are two copies of the modified disk block: the original version and the updated version. By restoring the state of the logical disk to point to the original version of all the disk blocks that were modified during the transaction, the transaction can be rolled back, that is, the state of the disk at the beginning of the transaction can be restored.
The concepts of transactions on virtual disks and snapshots of virtual disks have a number of limitations. The first is that they are useful only in the context of restoring the state of the disk: These systems provide no way to recover from, for example, failures caused by errors in a peripheral device.
Another limitation is that, during the operation of a typical computer system, the state of the disk is not complete: Modern operating systems employ disk caches that contain copies of data from the disk, as well as data that needs to be written to the disk. Applications also buffer data, so that even the operating system itself lacks a complete view of all the data entered by a user of the computer system. Snapshots of the disk state taken at an arbitrary point are only as consistent as the disk would be if the computer system were to crash at that point. On the other hand, any data that is present in the cache or in application memory, but that is not yet written to disk, is lost.
If snapshots of the disk state are taken only at points when the operating system is shut down, then the disk is in a consistent state, and no data is lost. However, this represents a significant limitation on the concept of transactions: Before a transaction can begin or end, all applications must be closed and the operating system must be shut down. This makes the snapshot technique inadequate to restore the full state of the disk when the system or an application xe2x80x9ccrashes,xe2x80x9d that is, when an application terminates other than as a result of a prescribed shut-down routine and whose execution cannot proceed. Alternatively, the application or operating system must explicitly issue commands that cause the buffered or cached data to be written back to the disk. In short, the reality of modern systems does not always conform to the xe2x80x9ccleanxe2x80x9d assumptions of the snapshot model, or they require the explicit coordination of application or operating system software.
The technique of taking snapshots (also known as xe2x80x9ccheckpointingxe2x80x9d) has also been used not only for virtual disks, but also for other subsystems such as file systems. Moreover, checkpointing has also been proposed for applications, and, in certain very restricted senses and cases, for systems as a whole. Examples of each will now be given.
One example of checkpointing of file systems is disclosed in xe2x80x9cDeciding when to forget in the Elephant file system,xe2x80x9d D. Santry, et al., Proceedings of the 17th ACM Symposium on Operating Systems Principles, Charleston, S.C. This xe2x80x9cElephant File Systemxe2x80x9d uses copy-on-write techniques, as well as per-file characteristics to implement checkpointing of the file system, albeit only on a file-by-file basis.
Other checkpointing techniques for file systems are described in xe2x80x9cFile system design for a file server appliance,xe2x80x9d D. Hitz, et al., Proceedings of the 1994 Winter USENIX Technical Conference, pages 235-245, San Francisco, Calif., January 1994; and xe2x80x9cScale and performance in a distributed file system,xe2x80x9d J. Howard, et al., ACM Transactions on Computer Systems, 6(1):51-81, February, 1988. In both of these systems, copy-on-write techniques are used to create whole file system checkpoints.
Many different proposals have also been put forward for checkpointing systems in certain restricted situations. One such proposal for the system known as KeyKOS is described, for example, in xe2x80x9cThe Checkpoint Mechanism in KeyKOS,xe2x80x9d C. Landau, Proceedings of the Second International Workshop on Object Orientation in Operating Systems, September 1992. The KeyKOS system, which operates as a microkernel-based operating system (OS), treats an entire system (from a software perspective) as a collection of objects and periodically takes checkpoints of all the objects. After a crash, the objects can be restored and the system resumed. One shortcoming of the KeyKOS system is that it requires new system software to be written, in particular, new application program interfaces (API""s). Yet another disadvantage of KeyKOS is that, after a crash, the OS still needs to go through a boot-up process before restoring the objects.
Still another known system-checkpointing technique is described in xe2x80x9cEROS: a fast capability system,xe2x80x9d J. Shapiro, et al., Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP ""99), December 1999, Charleston, S.C. Like KeyKOS, this EROS system is an object-oriented operating system with objects that are made persistent by checkpointing them. This checkpointing requires that all state resides in special objects called xe2x80x9cpagesxe2x80x9d and xe2x80x9cnodes,xe2x80x9d and that all kernel (OS) operations are atomic. Like KeyKOS, the system requires a new API, that is, new software, to be written, and requires O/S coordination. In EROS, periodic copies (checkpoints) are made of all objects, which are saved using copy-on-write techniques. Also like KeyKOS, the EROS system requires an O/S reboot after a crash.
As its title implies, U.S. Pat. No. 5,715,464 (Crump, et al., Feb. 3, 1998, xe2x80x9cComputer system having suspend once resume many sessionsxe2x80x9d) describes a computer system that has suspend once resume many (SORM) sessions. This SORM arrangement operates in a manner similar to the way in which existing portable computers are able to xe2x80x9csuspendxe2x80x9d their operation, for example, when the lid is closed, and then resume operation when reactivated. In the SORM system described in the Crump ""464 patent, however, the suspended image is preserved after resuming and thus may be restored multiple times, although subject to the very restrictive condition that the suspended image may no longer be valid after the next disk access in a resumed system. Moreover, the disclosed system-checkpointing solution describes possibility of keeping multiple suspended images, each for a different operating system, so that one can alternate between running the suspended operating systems.
Yet another system with features similar to the suspend-to-disk features of a portable computer is disclosed in U.S. Pat. No. 5,758,174 (Crump, et al., May 26, 1998, xe2x80x9cComputer system having a plurality of stored system capability states from which to resumexe2x80x9d). In this system, multiple suspended images may be kept and the user may resume from any one of them.
In both the Crump ""464 and ""174 systems, the operating system (OS) and application software must participate in the suspension and must go through a shutdown and a wake-up phase. In particular, these known systems require software executing within the operating-system, such as an Advanced Power Management (APM) driver, and applications/subsystems to register with the APM driver.
Still another system of this type is described in U.S. Pat. No. 5,386,552 (Garney, et al., Jan. 31, 1995, xe2x80x9cPreservation of a computer system processing state in a mass storagexe2x80x9d). In this system, the contents of system registers and system memory are saved in a mass storage device upon the occurrence of a triggering event, such as during power-off or when the system is to enter a low-power mode. The system then enters a suspend state. Once processing is resumed, the contents of a previously saved processing state are read in and control is returned to the previously running application program. This system requires two separate modulesxe2x80x94a special interrupt handler and a system management modulexe2x80x94to handle saving different partitionsxe2x80x94isolated and non-isolatedxe2x80x94of the memory.
As in other suspend-and-resume systems, in the Garney system, the evolution of the computer system state is always moving forward in a linear trajectory. In other words, once the system is resumed, there is no way to go back to the previously suspended state. This is in part because the contents of the disk, which are not saved when the system enters the suspend state, may be freely modified after resumingxe2x80x94any post-resume modification prevents resuming again from the previously saved state. Thus, is not possible to resume multiple times from a saved image. It is also not possible to save the state, continue execution, and then resume later from the saved state.
The Garney system also illustrates another common disadvantage of existing arrangements that provide for saving at least some part of the system state: It requires that software within the system itself must participate in saving the system state. Thus, in order to save the partial state in the Garney system, the additional system software needs to cause the processor to go into a system management interrupt state so that it can access a system management memory area. The processor must also be in the system management interrupt state in order to ensure that a critical part of the save routine will not be interrupted by a hardware interrupt.
One known system for checkpointing applications is the xe2x80x9cCondorxe2x80x9d distributed processing system, which is described in xe2x80x9cCheckpoint and Migration of UNIX Processes in the Condor Distributed Processing System,xe2x80x9d M. Litzkow, et al., University of Wisconsin-Madison Computer Sciences Technical Report #1346, April 1997; and xe2x80x9cSupporting Checkpointing and Process Migration Outside the UNIX Kernel,xe2x80x9d M. Litzkow, et al., Proceedings of the 1994 Winter USENIX Technical Conference, San Francisco, Calif., January 1992. The Condor system checkpoints the processes of running applications, and can migrate them to other machines as long as these also are running Condor. Only the application state is checkpointed, however, and the applications themselves must participate in the checkpointing by making calls to a checkpoint library.
All of the known systems and methods mentioned above suffer from one or more of the following disadvantages:
1) They save only part of the entire system state; as such, they cannot ensure complete restoration of the system state sufficient to guarantee that all applications will be able to continue exactly as they would have when the saved state is restored.
2) They are not able to generate checkpoints and save the state of the system at arbitrary points, or at multiple points. The systems will therefore not correctly save the partial state except when processing is interrupted at specific points or under specific conditions. This implies, of course, that there will be circumstances when the state cannot be saved at all. This means, in turn that such systems cannot be used for such operations as full-state, step-by-step debugging of applications. In many cases, this limitation is caused by a need for synchronization of the partial state-saving procedure with applications, or a need to wait for some other internal processxe2x80x94such as a shut down of some sub-systemxe2x80x94to be completed before saving the partial state.
3) They require specialized system software such as special API""s or operating systems. Alternatively, they assume and work only for particular operating systems and hardware architectures. They are therefore not beneficial to the most common usersxe2x80x94those who need to run off-the-shelf applications using an off-the-shelf operating system. An additional consequence of this is that the checkpoints are not portable between different systems.
4) They need to flush disk caches.
What is needed is some way to overcome these disadvantages of the prior art, and in particular, to extract and restore the entire state of the computer system as a whole, not just of some portion of the memory. This then would enable complete restoration of the system to any point in its processing without requiring any application or operating system intervention, or any specialized or particular system software (such as API""s and OS""s) or hardware architecture. This invention provides a system and method that accomplishes this, and it does so in a way that makes possible even other unique features, such as the ability for one or even multiple users to run, evaluate, test, restart, and duplicate a processing stream not only from the same point, but also from different points. The invention accomplishes this, moreover, in a manner that allows checkpointing the entire state of the system in a way that allows state information to be portable between different hardware platforms and system software configurations.
The preferred embodiment of the invention provides a computer system that has a hardware processor; physical system storage; a virtual machine monitor functionally connected to the hardware processor and physical system storage; and at least one virtual machine that is functionally connected to the virtual machine monitor.
Each virtual machine includes: a) a virtual processor that has a plurality of registers and executes a sequence of virtual processor instructions, virtual processor instructions being conveyed by the virtual machine monitor to the hardware processor, physical execution of the virtual processor instructions being carried out by the hardware processor; b) a virtual memory; c) a virtual operating system; d) at least one virtual peripheral device having a plurality of peripheral settings and peripheral device data; and e) at least one application program executing on the virtual machine.
At the completion of execution of each processor instruction by the virtual processor, the virtual machine has a total state that includes state information of the virtual processor, of the virtual memory, of the virtual operating system, of each application program, and of each virtual peripheral device. The virtual machine monitor is then used as a checkpointing mechanism a) for interrupting the virtual processor; b) for sensing a checkpoint request for the virtual machine; and c) for generating at least one checkpoint, each checkpoint comprising a list of the total state of the virtual machine corresponding to the respective checkpoint request, at the completion of execution of a respective virtual processor instruction.
In the preferred embodiment of the invention, checkpointing mechanism is further provided a) for storing each generated checkpoint; b) for sensing a state restoration request corresponding to one of the checkpoints; and c) upon sensing the state restoration request, for restoring into the virtual machine the total state corresponding to the checkpoint that in turn corresponds to the state restoration request, the virtual machine thereby resuming execution from the restored total state, that is, of the sequence of virtual processor instructions following the virtual processor instruction after whose completion the checkpoint was generated, identically as it would have if the virtual processor had not been interrupted.
The physical system storage preferably includes a checkpoint storage portion accessible only by the virtual machine monitor. Each generated checkpoint is then stored in the checkpoint storage portion. Stored checkpoints thereby remain unchanged by other access to the physical system storage. In this case, the total state of the virtual machine excludes the contents of the corresponding checkpoint storage portion.
The invention allows the use of a plurality of virtual machines. In this case, one of the virtual machines may serve as a source virtual machine, for which at least one checkpoint is generated, and the checkpointing mechanism is then further provided for loading into at least one secondary virtual machine, which is different from the source virtual machine, the total state of the source virtual machine corresponding to one of the generated checkpoints. The different virtual machines may also be installed on different virtual machine monitors, and the different virtual machine monitors may even be installed on different hardware platforms.
If the source virtual machine is located within a source computer, the secondary virtual machine may be located within a secondary computer that contains a secondary virtual machine that is structurally substantially identical to the source virtual machine. The source and secondary computers may thereby have separate hardware processors and physical system storage. The system according to the invention may then include a transmission medium via which the checkpoint is transferred from the source virtual machine to the secondary virtual machine. Examples of the transmission medium include a portable memory device and a network to which the source and secondary computers are connected.
One advantage of having distinct but structurally identical virtual machines running on a virtual machine monitor is that they may then be independent of an underlying system hardware platform that includes the hardware processor and the physical system storage. The total state of any source virtual machine is thus exportable to any other secondary virtual machine.
The computer system according to the invention may include a plurality of peripheral system hardware devices. In this case, in the preferred embodiment, for each of the plurality of peripheral system hardware devices, the virtual machine monitor preferably includes a respective device emulation module corresponding to a predetermined emulation of a representative, virtualized hardware device corresponding to the peripheral system hardware device. The device emulation modules then form a device interface between the virtual machines and the peripheral system hardware devices. The virtual machine monitor then preferably further includes a conversion module converting device control signals and data from any virtual machine in a format of the respective representative virtualized hardware device into corresponding signals and data into a format of the corresponding peripheral system hardware device, and vice versa. Each virtual machine is thereby made independent of the underlying system hardware platform, which includes the hardware processor, the physical system storage, and the peripheral system hardware devices. The total state of any source virtual machine then is exportable to any other secondary virtual machine, regardless of what physical peripherals might be included in a particular system.
In an embodiment of the invention that does not rely on the use of virtual machines and a virtual machine monitor there is at least one application program running on system software such as an operating system. In this embodiment, the mechanism for generating checkpoints is a driver installed in the operating system.
According to the invention, checkpoints may be restored a plurality of times. In such case, the corresponding virtual machine thereby resumes execution multiple times from an identical total state.
Whenever multiple checkpoints are generated and stored, each subsequently generated checkpoint is generated without changing any earlier generated checkpoint.
Also in the preferred embodiment of the invention, any arbitrary one of the stored checkpoints may be restored into an arbitrary one of the virtual machines. Moreover, the same stored checkpoint may be loaded concurrently into a plurality of virtual machines; each virtual machine is then able to proceed from the loaded checkpoint independently of the other virtual machines.
The checkpointing procedure does not require modification or long suspension of the operation of a virtual machine. Rather, according to the invention, unmodified execution of the virtual machine may be resumed immediately after the step of generating each checkpoint.
The invention also provides a way to reduce how much storage space is needed for storing the state information contained in checkpoints. If this feature is implemented, then the steps of sensing each checkpoint request and generating a corresponding checkpoint may then include the following sub-steps: a) generating an initial checkpoint including the total state of the virtual machine; and b) for at least one subsequent checkpoint after the initial checkpoint, generating the subsequent checkpoint as a list of state changes relative to an immediately preceding generated checkpoint as well as link information to the immediately preceding generated checkpoint.
According to another aspect of the preferred embodiment of the invention, checkpoints may also be deleted. In this case, at least one checkpoint is designated for deletion. If it is the initial checkpoint is the one designated for deletion, all the total state information in the initial checkpoint is incorporated into each checkpoint immediately following the designated checkpoints. If the initial checkpoint is not one designated for deletion then all state change information in the checkpoints designated for deletion is incorporated into each checkpoint immediately following the designated checkpoints. (Note that a checkpoint may have more than one immediate successor, for example, when a checkpoint has been loaded into more than one other virtual machine, or when more than one parallel processing branch is established in a single virtual machine, all branches starting from the same checkpoint.) All designated checkpoints are then deleted.
According to another aspect of the invention, any virtual machine may be operated as a transactional computer system. In this case, each checkpoint marks the beginning of a transaction, or a portion of a larger transaction. In this embodiment, one of the subsequent checkpoints is designated as a commit checkpoint. In order to commit the checkpoint, all the total state information in the initial checkpoint as well as all state change information in all checkpoints between the initial checkpoint and the commit checkpoint, is incorporated into the commit checkpoint, and the commit checkpoint then becomes a new initial checkpoint.
In embodiments of the invention in which the virtual machine includes at least one virtual peripheral device, the virtual memory is preferably configured to consist of virtual volatile memory, including portions for a virtual device state of each virtual peripheral device, and virtual non-volatile storage. Upon system power off, the total state which forms checkpoints is then the total definable state, consisting of the contents of the virtual non-volatile storage. Generation a checkpoint as a list of the total state of the virtual machine then comprises including as the total state in the checkpoint only the total definable state.