1. Field of the Invention
This application relates to an arrangement of a computer system, in particular, to a system and a method for acquiring, storing and using data concerning the state of hardware and software components within the computer system.
2. Description of the Related Art
Modern computers “crash” with irritating frequency, with much work lost or recovered only with time-consuming effort. Sometimes, crashes or other errors are expected, for example, when designing new software or debugging an existing program. In such cases, and even when first turning the computer on, time is also lost waiting for computers to “boot” or “reboot.” At other times, when problems occur for an ordinary user of a commercial application, even more time is often lost when the frustrated user must try to explain orally what has happened to a technician located far away in a customer service department. These are just a few of many possible examples of situations when information about the state of the computer system is either desirable, for example, when debugging a new program, or necessary, for example, when the computer is to reboot and automatically load previously running applications along with the data they were processing when exited.
One known attempt to ensure the ability to analyze and reconstruct the state of a physical memory, disk or data base is based on the concept of a “transaction,” which involves on-going tracking of updates to at least one region of storage. In this context, a transaction is a collection of updates that are bundled together so that they are atomic that is, either all of the updates occur, or none of them occur. The idea of transactions is typically applied to databases, where a series of updates to different tables need to occur simultaneously.
A transaction proceeds as follows: A begin command from the operating system or an application marks the beginning of the series of updates that make up the transaction. After the updates complete, a commit command marks the end of the transaction and the updates become permanent. If an error occurs during one of the updates that are part of the transaction, a rollback command is used to undo any updates in the transaction that may have completed.
Transactional Disks
In the prior art, this use of the concept of transactions is commonly implemented in database systems. Recently, transactions have been extended to apply to logical disks (also referred to as virtual disks), which are a software construct that emulate physical disks. One example of this solution, in the context of a parallel or distributed processing arrangement, is described in U.S. Pat. No. 5,634,096 (Baylor, et al., 27 May 1997, “Using virtual disks for disk system checkpointing”), which discloses a scheme for storing data on disks in such a way that a “checkpoint” is taken across several disks connected to different processors. This checkpoint is then used to restore the entire disk system to a known state after one or more of the disks or processors fails.
Yet another solution involving virtual disks is described in “The Logical Disk: A New Approach to Improving File Systems,” by de Jonge, Kaashoek, and Hsieh, in Proceedings of the 14th ACM Symposium on Operating System Principles, pp. 15-28, December 1993. In this paper, the term “Atomic Recovery Unit” is used to describe transactions to the logical disk.
The implementation of a logical disk requires the interception of requests to the physical disk, and transforming them into operations on a logical disk. Once this has been accomplished, it is possible to keep a log of all of the updates to the logical disk and defer the update so that the original data is not overwritten. When the updates are kept in a log in this fashion, then a rollback can be accomplished by discarding the updates in the log for a particular transaction. A commit can be accomplished by retaining these updates in the log, and eventually applying them to the logical disk. A similar concept has been proposed in “Petal: Distributed Virtual Disks,” by Lee and Thekkath, in Proc. 1 “Intl. Conf. On Architectural Support for Programming Languages and Operating Systems,” pp. 84-92, October 1996. The Petal virtual disk supports the ability to take snapshots of the virtual disk, using techniques known as “copy-on-write.” Copy-on-write is a common technique that allows copies to be created quickly, using a table of pointers to the actual data, and only copying the data when it is modified by a user program.
In Petal, the virtual disk itself is implemented as a table of pointers, and the snapshot (equivalent to a “checkpoint”) is implemented by including an identifier (called an epoch number) in this table. When a snapshot is taken, the current epoch number is assigned to the snapshot. The epoch number is then incremented, and all subsequent updates to the virtual disk belong to this new epoch number. When a block of the disk is next updated, there will be no copy at the current epoch number, so a copy of the block will be created. In short, as the term “copy-on-write” implies, a copy is made only when a disk block is written to. The original data is still available, under the epoch number of the snapshot.
Both the logging technique and the snapshot technique allow the implementation of transactions on a logical disk. In both cases, there are two copies of the modified disk block: the original version and the updated version. By restoring the state of the logical disk to point to the original version of all the disk blocks that were modified during the transaction, the transaction can be rolled back, that is, the state of the disk at the beginning of the transaction can be restored.
The concepts of transactions on virtual disks and snapshots of virtual disks have a number of limitations. The first is that they are useful only in the context of restoring the state of the disk: These systems provide no way to recover from, for example, failures caused by errors in a peripheral device.
Another limitation is that, during the operation of a typical computer system, the state of the disk is not complete: Modern operating systems employ disk caches that contain copies of data from the disk, as well as data that needs to be written to the disk. Applications also buffer data, so that even the operating system itself lacks a complete view of all the data entered by a user of the computer system. Snapshots of the disk state taken at an arbitrary point are only as consistent as the disk would be if the computer system were to crash at that point. On the other hand, any data that is present in the cache or in application memory, but that is not yet written to disk, is lost.
If snapshots of the disk state are taken only at points when the operating system is shut down, then the disk is in a consistent state, and no data is lost. However, this represents a significant limitation on the concept of transactions: Before a transaction can begin or end, all applications must be closed and the operating system must be shut down. This makes the snapshot technique inadequate to restore the full state of the disk when the system or an application “crashes,” that is, when an application terminates other than as a result of a prescribed shut-down routine and whose execution cannot proceed. Alternatively, the application or operating system must explicitly issue commands that cause the buffered or cached data to be written back to the disk. In short, the reality of modern systems does not always conform to the “clean” assumptions of the snapshot model, or they require the explicit coordination of application or operating system software.
The technique of taking snapshots (also known as “checkpointing”) has also been used not only for virtual disks, but also for other subsystems such as file systems. Moreover, checkpointing has also been proposed for applications, and, in certain very restricted senses and cases, for systems as a whole. Examples of each will now be given.
File System Checkpointing
One example of checkpointing of file systems is disclosed in “Deciding when to forget in the Elephant file system,” D. Santry, et al., Proceedings of the 17th ACM Symposium on Operating Systems Principles, Charleston, S.C. This “Elephant File System” uses copy-on-write techniques, as well as per-file characteristics to implement checkpointing of the file system, albeit only on a file-by-file basis.
Other checkpointing techniques for file systems are described in “File system design for a file server appliance,” D. Hitz, et al., Proceedings of the 1994 Winter USENIX Technical Conference, pages 235-245, San Francisco, Calif., January 1994; and “Scale and performance in a distributed file system,” J. Howard, et al., ACM Transactions on Computer Systems, 6(1):51-81, February, 1988. In both of these systems, copy-on-write techniques are used to create whole file system checkpoints.
System Checkpointing
Many different proposals have also been put forward for checkpointing systems in certain restricted situations. One such proposal for the system known as KeyKOS is described, for example, in “The Checkpoint Mechanism in KeyKOS,” C. Landau, Proceedings of the Second International Workshop on Object Orientation in Operating Systems, September 1992. The KeyKOS system, which operates as a microkernel-based operating system (OS), treats an entire system (from a software perspective) as a collection of objects and periodically takes checkpoints of all the objects. After a crash, the objects can be restored and the system resumed. One shortcoming of the KeyKOS system is that it requires new system software to be written, in particular, new application program interfaces (API's). Yet another disadvantage of KeyKOS is that, after a crash, the OS still needs to go through a boot-up process before restoring the objects.
Still another known system-checkpointing technique is described in “EROS: a fast capability system,” J. Shapiro, et al., Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP '99), December 1999, Charleston, S.C. Like KeyKOS, this EROS system is an object-oriented operating system with objects that are made persistent by checkpointing them. This checkpointing requires that all state resides in special objects called “pages” and “nodes,” and that all kernel (OS) operations are atomic. Like KeyKOS, the system requires a new API, that is, new software, to be written, and requires O/S coordination. In EROS, periodic copies (checkpoints) are made of all objects, which are saved using copy-on-write techniques. Also like KeyKOS, the EROS system requires an O/S reboot after a crash.
As its title implies, U.S. Pat. No. 5,715,464 (Crump, et al., 3 Feb. 1998, “Computer system having suspend once resume many sessions”) describes a computer system that has suspend once resume many (SORM) sessions. This SORM arrangement operates in a manner similar to the way in which existing portable computers are able to “suspend” their operation, for example, when the lid is closed, and then resume operation when reactivated. In the SORM system described in the Crump '464 patent, however, the suspended image is preserved after resuming and thus may be restored multiple times, although subject to the very restrictive condition that the suspended image may no longer be valid after the next disk access in a resumed system. Moreover, the disclosed system-checkpointing solution describes possibility of keeping multiple suspended images, each for a different operating system, so that one can alternate between running the suspended operating systems.
Yet another system with features similar to the suspend-to-disk features of a portable computer is disclosed in U.S. Pat. No. 5,758,174 (Crump, et al., 26 May 1998, “Computer system having a plurality of stored system capability states from which to resume”). In this system, multiple suspended images may be kept and the user may resume from any one of them.
In both the Crump '464 and '174 systems, the operating system (OS) and application software must participate in the suspension and must go through a shutdown and a wake-up phase. In particular, these known systems require software executing within the operating system, such as an Advanced Power Management (APM) driver, and applications/subsystems to register with the APM driver. Furthermore, each suspended image must belong to a different OS, or instance of an OS, since the image does not include the state of the disk at the time the system was suspended. Resuming an OS will thus alter the contents of the disk associated with that OS at the next occurrence of a disk write, causing any suspended image associated with that OS to be inconsistent with the state of the disk. Another limitation is that neither system employs any form of copy-on-write techniques to reduce the amount of saved state.
Still another system of this type is described in U.S. Pat. No. 5,386,552 (Garney, et al., 31 Jan. 1995, “Preservation of a computer system processing state in a mass storage”). In this system, the contents of system registers and system memory are saved in a mass storage device upon the occurrence of a triggering event, such as during power-off or when the system is to enter a low-power mode. The system then enters a suspend state. Once processing is resumed, the contents of a previously saved processing state are read in and control is returned to the previously running application program. This system requires two separate modules—a special interrupt handler and a system management module—to handle saving different partitions—isolated and non-isolated—of the memory.
As in other suspend-and-resume systems, in the Garney system, the evolution of the computer system state is always moving forward in a linear trajectory. In other words, once the system is resumed, there is no way to go back to the previously suspended state. This is in part because the contents of the disk, which are not saved when the system enters the suspend state, may be freely modified after resuming—any post-resume modification prevents resuming again from the previously saved state. Thus, it is not possible to resume multiple times from a saved image. It is also not possible to save the state, continue execution, and then resume later from the saved state.
The Garney system also illustrates another common disadvantage of existing arrangements that provide for saving at least some part of the system state: It requires that software within the system itself must participate in saving the system state. Thus, in order to save the partial state in the Garney system, the additional system software needs to cause the processor to go into a system management interrupt state so that it can access a system management memory area. The processor must also be in the system management interrupt state in order to ensure that a critical part of the save routine will not be interrupted by a hardware interrupt.
Application/Process-Level Checkpointing
One known system for checkpointing applications is the “Condor” distributed processing system, which is described in “Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System,” M. Litzkow, et al., University of Wisconsin-Madison Computer Sciences Technical Report #1346, April 1997; and “Supporting Checkpointing and Process Migration Outside the UNIX Kernel,” M. Litzkow, et al., Proceedings of the 1994 Winter USENIX Technical Conference, San Francisco, Calif., January 1992. The Condor system checkpoints the processes of running applications, and can migrate them to other machines as long as these also are running Condor. Only the application state is checkpointed, however, and the applications themselves must participate in the checkpointing by making calls to a checkpoint library.
All of the known systems and methods mentioned above suffer from one or more of the following disadvantages:
They save only part of the entire system state; as such, they cannot ensure complete restoration of the system state sufficient to guarantee that all applications will be able to continue exactly as they would have when the saved state is restored.
They are not able to generate checkpoints and save the state of the system at arbitrary points, or at multiple points. The systems will therefore not correctly save the partial state except when processing is interrupted at specific points or under specific conditions. This implies, of course, that there will be circumstances when the state cannot be saved at all. This means, in turn that such systems cannot be used for such operations as full-state, step-by-step debugging of applications. In many cases, this limitation is caused by a need for synchronization of the partial state-saving procedure with applications, or a need to wait for some other internal process—such as a shut down of some sub-system—to be completed before saving the partial state.
They require specialized system software such as special API's or operating systems. Alternatively, they assume and work only for particular operating systems and hardware architectures. They are therefore not beneficial to the most common users—those who need to run off-the-shelf applications using an off-the-shelf operating system. An additional consequence of this is that the checkpoints are not portable between different systems.
They need to flush disk caches.
What is needed is some way to overcome these disadvantages of the prior art, and in particular, to extract and restore the entire state of the computer system as a whole, not just of some portion of the memory. This then would enable complete restoration of the system to any point in its processing without requiring any application or operating system intervention, or any specialized or particular system software (such as API's and OS's) or hardware architecture. This invention provides a system and method that accomplishes this, and it does so in a way that makes possible even other unique features, such as the ability for one or even multiple users to run, evaluate, test, restart, and duplicate a processing stream not only from the same point, but also from different points. The invention accomplishes this, moreover, in a manner that allows checkpointing the entire state of the system in a way that allows state information to be portable between different hardware platforms and system software configurations.