Capturing the cause(s) of system failures, i.e., crashes, is an important feature in modern operating systems. In the Windows family of operating systems, e.g., Windows NT, Windows 2000, Windows XP, crashdump is implemented via a complex interaction between the Windows kernel (ntoskrnl) and some lower level storage drivers. Some background relating to Windows storage, with emphasis on how crashdump interacts with the storage stack, is shown in FIGS. 1A and 1B. FIG. 1A shows a diagram of the Windows storage stack during normal operation, and FIG. 1B shows exemplary aspects of bypassing the Windows storage stack when system failure occurs.
Describing the actions of each layer in the stack of FIG. 1A, NTOSKRNL 200, a.k.a. the kernel, is responsible for determining whether an I/O request needs to be generated for this request, generating the request (IRP), and marshaling buffers, if necessary. The file-system 205 imposes file structure on the raw disk. The volume shadow copy 210 provides for lazy volume snapshots, which are used for live backup and rollback of files. The volume manager 215 presents user-level volumes, e.g., “C:”, “D:” etc. This is the bottom of the volume stack. The volume manager 215 may also provide redundancy or striping capabilities. Therefore multiple disks may be aggregated by the volume layer into a single volume.
The partition manager 220 is the top of the device stack. The partition manager 220 has a private interface to volume manager 215, notifying the volume manager 215 when partitions come and go. The partition manager layer allows multiple disk-partitions to be exposed from a single disk drive. The term “disk-partition” is being used here in a nonstandard way to avoid confusion between partitions of a disk drive and partitions within a hypervisor.
With respect to disk 225, the disk class driver 225 translates IRP-based commands into SCSI commands, for instance, using the SCSI_REQUEST_BLOCK data structure. Disk 225 also manages any disk-specific aspects of the storage request. The port layer 230 manages a specific controller or adapter 240, which in turn interfaces with hard disk 245. For example, ScsiPort manages a SCSI controller (adapter), ATAPORT manages an IDE controller and SBP2PORT manages a 1394 controller. The port layer 230 also translates commands from a SCSI command set to non-SCSI command for devices that do no support the SCSI command set (such as IDE).
Miniport 235 is a vendor supplied layer that works in conjunction with a port driver 230 to access the controller hardware. Many types of storage controllers do not have standardized hardware interfaces, and therefore require vendor-supplied miniports 235 to program the hardware. Controllers that do have standardized hardware interfaces do not generally require miniports 235.
Ordinarily, during the writing of a crashdump, Windows bypasses most of these components and writes data directly to the port driver 230. This allows the operating system to successfully generate a crashdump in the presence of failures in the higher layers of the stack (such as the file-system 205). The crash dump stack is shown in FIG. 1B, illustrating exemplary operation of the Windows dump stack during a crashdump.
During a crashdump, the kernel NTOSKRNL 200 acts as the top seven layers in the storage stack, bypassing the file-system 205, volume shadow copy 210, volume manager 215, partition manager 220 and disk class layers 225. The kernel 200 communicates directly with a special purpose dump port driver 250 using a custom interface. The kernel 200 uses a private, synchronous interface to communicate with the port driver 250. The dump port driver 250 implements this interface, and it also implements the miniport interface, if necessary.
The miniport 235, if present, fulfills the same role in the dump stack as in the regular storage stack. Specifically, the miniport 235 provides a mechanism to submit commands to the storage adapter 240. In the ScsiPort and StorPort architectures, the miniport 235 may distinguish between the normal and dump operations for scanning for the “dump=1” string in the parameters passed into the HwFindAdapter routine of miniport 235.
The Windows dump stack as shown in FIG. 1B has a number of notable implications. Since the file-system 205 is not present during the crash, the kernel 200 needs to maintain enough file-system information to write to the dump file even without the file-system present. This is generally done by calculating the raw sector offsets on the disk and writing directly to the disk using the file-system control FSCTL_QUERY_RETRIEVAL_POINTERS. The file system 205 also may not modify the dump file in any manner after the file has been prepared to accept a crashdump. A file is said to be “locked” in the sense that the sectors comprising the file may not be moved, e.g., for defragmentation.
The volume management layer 215 is responsible for providing software-base redundancy and virtualization. For example, Microsoft's software RAID and striping implementations are implemented in the volume manager layer 215. Because the dump stack bypasses the volume management layer 215, the dump takes place to a volume that has redundancy or striping implemented atop of it.
The partition manager 220 and disk layers 225 manage the partition table for the device. The partition table specifies where a partition begins on a disk and the partition's size. The partition table on the volume that the dump is intended for therefore may not be modified.
Since a system crash may occur at any time, a crash may occur when locks are held or when at a raised IRQL. Therefore, the dump port driver 250 may not acquire locks, allocate memory, wait for resources, access paged data, etc. This limited environment is the reason that the mainline port driver 230 is not used to perform the crashdump. The mainline port driver 230 generally manages locks and other resources which are not feasible tasks to perform at crash time.
The miniport 235 (if present) has the same requirements as the port driver 230. Luckily, some miniport designs 235 do not expose such high-level primitives as locks and IRQL to miniport authors, so these issues are easily virtualized.
By way of further background, Windows supports three types of crashdumps: a full memory dump, a kernel memory dump and a minidump. A full memory dump dumps the entire physical memory of the machine. The kernel memory dump dumps only that portion of the address space devoted to kernel-memory. The minidump is a very small dump (generally 64 KB in size) that captures the minimal information necessary to triage and perform minimal debugging of the failure.
In a typical virtual machine environment, multiple virtual machines or “partitions” run on top of virtualizing software. This software, in turn, runs on top of hardware. The virtualizing software exposes the hardware in such a fashion that allows for a plurality of partitions, each with its own operating system (OS), to run on the hardware. The hardware is thus virtualized for the partitions by the virtualizing software.
Individual partitions are able to run disparate OSes, such as Windows, Linux, Solaris, MacOS and so on. These OSes can be isolated from each other such that if one OS in a partition crashes it will not affect other OSes in other partitions. Additionally, allowing multiple OSes to run on a single piece of hardware but in different partitions makes it easy to run different versions of software developed for different versions or types of OSes.
With respect to a crashdump architecture in a virtual environment, in a hypervisor/VMM environment, for instance, there are several additional problems and situations presented by crashdump.
The terms hypervisor and virtual machine manager (VMM) are used herein interchangeably, whether utilized in conjunction with or part of a host operating system or not; and the terms virtual machine and partition are also used interchangeably, i.e., where the term partition is used, this should be considered the same as the term virtual machine.
Frequently in a hypervisor or VMM environment, the hypervisor component will not have direct access to a physical storage device. In such an environment, it will not generally be possible for the hypervisor to generate a crashdump file because it does not have access to a storage device. Thus, a first problem for failure management in a virtual environment is that the hypervisor does not have access to storage to write a crashdump file.
In a secure environment, the principle goal is to ensure that secrets are never exposed. Assuming solution of the first problem above, and are able to generate a crashdump for a machine, it may be that secret data that was private to a virtual machine is exposed through the crashdump. Thus, a second problem for failure management in a virtual environment is that secrets may be exposed via a crashdump.
When generating crashdumps, a goal is to minimize the amount of data that is saved to the dump. Minimizing the amount of dump data serves two purposes. First, it reduces the size of the dump, and associated storage space that the dump consumes. Additionally, when the size of the crashdump is reduced, the speed to generate the crashdump is increased. Thus, a third problem for failure management in a virtual environment is that it is desired to reduce the amount of storage for a dump, and reduce the time to generate a dump.
The current Windows crashdump architecture has several other limitations as well. For instance, a crashdump may be generated only to the boot drive, badly corrupted machines will not generate crashdumps, and the crashdump code requires specific storage drivers to correctly operate. Thus, a fourth problem for failure management in a virtual environment is the avoidance of limitations in the current crashdump architecture. How these and other problems are addressed by the invention is described in the various following sections.