Deterministic replay in a virtual machine creates an execution that is logically equivalent to an original execution of interest. Two executions are logically equivalent if they contain the same set of dynamic instructions, each dynamic instruction computes the same result in the two executions, and the two executions compute the same final state of the system (processor, memory and devices). Virtual machines are software abstractions of physical computer systems, generally using virtualization software, which is typically a thin layer of software that logically sits, and provides an interface, between hardware and a guest Operating System (OS). Virtualization is well known to those in the field of computer science. Some virtualization functionality, however, has recently been implemented in hardware, including in recent microprocessor designs (as described further below) and in recent input/output (I/O) devices. Accordingly, the term “virtualization software” may be replaced by the term “virtualization logic” to encompass implementations involving any combination of software and/or hardware virtualization functionality. The term “virtualization software” will be primarily used throughout the following description, but this usage should not be understood as a limitation on the scope of the invention.
A virtual machine-based deterministic replayer may support full-system replay; i.e., the entire virtual machine (VM), including guest operating system (OS) and guest applications, is recorded and replayed. During recording, all sources of non-determinism from outside the virtual machine are captured and logged in a log file. These include data and timing of inputs to all devices, including virtual disks, virtual network interface cards (NICs), etc. A combination of techniques, such as device emulation and binary translation, are used to ensure deterministic replay as long as the recorded device input data are replayed at the right times.
Certain central processor unit (CPU) instructions are non-deterministic. A non-deterministic instruction is one whose output is not determined entirely by its inputs or a current architectural state. For example, the x86 RDTSC instruction returns the current time expressed in processor clocks, RDPMC and RDMSR return the contents of performance counter registers, etc. Thus, the outputs of non-deterministic instructions can arise from the interaction of the VM with a non-deterministic unit such as a real time clock, which is a device that can be queried by a CPU with a RDTSC instruction, whose result is returned in real-time and depends on when the instruction is executed. Examples of other non-deterministic units include input devices (such as a keyboard, mouse, microphone, etc.) a thermal sensor, a transducer, a network card, a video camera, and so on. Such devices are non-deterministic because they produce inputs that cannot be predicted based solely on the state of the machine.
When executing application(s) within a VM, the virtualization software can record the complete execution behavior. Having saved this behavioral information, the user can replay that exact and complete behavior an unlimited number of times. This can be useful for debugging. For example, a user can record execution of the VM, and then attach a gdb debugger to the guest operating system or guest applications during replay. On replay, the user can look at memory, set breakpoints, and single step through the execution to identify problems and resolve them. Of course, record and replay have other applications too, any of which may benefit from the embodiments described herein.
Record and replay techniques may also be used to provide fault tolerance capabilities in a virtualized computer system, so that the virtualized computer system may continue to operate properly in the event of a failure in hardware, virtualization software or host software. One way of providing fault tolerance is to run two virtual machines (a “primary” virtual machine, and a “backup” or “secondary” virtual machine) in near lockstep.
FIG. 1 illustrates a method of providing fault tolerance by record and replay, using a backup VM supporting the primary VM. A primary VM 200-1 is the “real” VM that is actually communicating externally of the virtualized computer system. A backup VM 200-2 is configured to take over almost instantaneously if primary host 100-1 or primary VM 200-1 fails.
The primary VM 200-1 runs at least one VCPU 210-1 and the guest OS 220-1, supported by virtualization software, which may comprise a hypervisor 601-1 including a VMM (Virtual Machine Monitor) 300-1 and a VMkernel 600-1, on host system hardware 100-1 supporting a virtual disk 240-1. The secondary VM 200-2 also runs at least one VCPU 210-2 and the guest OS 220-2, also supported by virtualization software, which may comprise a hypervisor 601-2 including a VMM 300-2 and a VMkernel 600-2, on host system hardware 100-2 supporting a virtual disk 240-2. FIG. 1 shows a separate virtual disk 240-1, 240-2 for each VM 200-1, 200-2 for purposes of illustration, however, the primary VM 200-1 and secondary VM 200-2 in a fault tolerance configuration may share a common virtual disk, which may be managed and modified exclusively by the primary VM 200-1 until the secondary VM 200-2 takes over in the event of a failure of the primary VM 200-1. While the virtualized computer system illustrated in FIG. 1 includes virtualization software comprising a hypervisor, which further comprises a VMkernel and a VMM, this invention may be implemented in a wide variety of virtualized computer systems having a wide variety of configurations of virtualization software or virtualization logic, as described in the prior art, including, in particular, earlier-filed patents and patent applications assigned to VMware, Inc., the assignee of this patent application. For the purposes of this disclosure, any action performed by the VMkernels 600-1, 600-2 may be considered to be performed by virtualization software or virtualization logic in a broader sense, such as by the hypervisors 601-1, 601-2.
One way of keeping the two VMs 200-1, 200-2 in near lockstep for fault tolerance is to record (log) all non-deterministic inputs or events encountered by the primary VM 200-1 in log entries 280 and send the log entries 280 to the backup VM 200-2. The VMM 300-1 corresponding to the primary VM 200-1 records such logs and the VMkernel 600-1 sends the log entries 280 to the VMkernel 600-2 corresponding to the secondary VM 200-2. Non-deterministic inputs/events include, for example, (i) all inputs from the network external to the virtualized computer system, (ii) information regarding when virtual interrupts were delivered to the VCPU 210-1 due to external events, (iii) timer interrupts delivered to the VCPU 210-1, and (iv) timestamps delivered to the VCPU 210-1 when the VCPU 210-1 acquires the current time via various hardware functionality. The VMM 300-2 corresponding to the backup VM 200-2 then uses the log entries 280 to ensure that the backup VM 200-2 executes exactly the same instruction stream as the primary VM 200-1 (i.e., the backup VM 200-2 replays the log 280). The VMkernel 600-2 corresponding to the secondary VM 200-2 sends acknowledgments (ACK) 282 back to the VMkernel 600-1 corresponding to the primary VM 200-1, indicating which log entries 280 have been received at the secondary VM 200-2 and which log entries 280 have been replayed on the secondary VM 200-2.
Deterministic replay by the backup VM 200-2 requires that the replay on the backup VM 200-2 behave substantially the same during logging and replaying phases. However, the memory management unit (MMU) of modern CPUs may be a source of non-determinism. In particular, MMUs of modern CPUs that include hardware support for processor (CPU) virtualization may be a source of non-determinism. Both Intel Corporation and Advanced Micro Devices, Inc. have introduced processor designs with hardware support for processor virtualization. Support in Intel processor designs is typically promoted as Intel Virtualization Technology (Intel VT-x) and was formerly known by the code-name “Vanderpool,” while support in AMD designs is typically promoted as AMD Virtualization (AMD-V) or Secure Virtual Machine (SVM) technology and was at one time known by the code-name “Pacifica.” Persons of ordinary skill in the art will generally be familiar with both AMD and Intel designs for hardware-assisted virtualization, which are detailed in published design documents such as Advanced Micro Devices, Inc., AMD64 Virtualization Codenamed “Pacifica” Technology: Secure Virtual Machine Architecture Reference Manual (2005) and Intel Corporation, Intel® Virtualization Technology Specification for the IA-32 Intel® Architecture (2005). Despite some apparent differences in terminology, persons of ordinary skill in the art will appreciate the substantial similarity of AMD and Intel hardware-assistance techniques. Among other possible uses, embodiments of this invention may be used to enable deterministic replay in a virtualized computer system having such modern CPUs. This patent describes the invention(s) in relation to these specific Intel and AMD processor designs, although the invention(s) may also be implemented in connection with other processor designs. Thus, more specifically, the memory management unit (MMU) of modern CPUs (e.g., Intel VT-x or AMD-V CPUs) may use in-memory data structures (e.g., Nested Page Tables (NPT) or Extended Page Tables (EPT)) as well as on-the-chip data structures (e.g., Translation Lookaside Buffers (TLB)) for caching accessed entries of the in-memory data structures. The TLB may provide a source of non-determinism, as will be explained in more detail below.
FIG. 2A illustrates generally how a linear page number (LPN) 406 is translated to a machine page number (MPN) 410 by a MMU 450 in a modern CPU 110. LPN 406 is the virtual address page number used by guest OS 220 (and guest applications executing on the guest OS 220) to access virtual memory. LPN 406 is translated to a physical page number (PPN) 408, using guest page table 402 maintained by guest OS 220. The PPN 408 is a physical page number from the perspective of guest OS 220. However, in order to access the actual system memory, PPN 408 is generally translated to a machine page number (MPN) 410 in virtualized computer systems. Prior patents and applications assigned to VMware describe methods that may be used by virtualization software to translate guest “physical” addresses specified by a guest OS (e.g. PPN 408) to machine addresses (e.g. MPN 410) that can be used to access actual physical memory. These prior patents and applications describe “shadow page tables” generated by virtualization software and used by a MMU to translate guest virtual addresses (e.g. LPN 406) into machine addresses (e.g. MPN 410). In some modern CPUs 110, however, the MMU 450 can translate the LPN 406 to a MPN 410 using guest page table 402 along with NPT or EPT 404. NPT or EPT 404 is typically maintained by virtualization software, such as VMM 300. As described in existing literature and as known in the art, the MMU 450 may retain a limited number of various mappings, including mappings from LPN 406 to PPN 408 and mappings from LPN 406 to MPN 410, among others, in a TLB 454 and in paging structure caches 456, to improve memory access times. In general terms, when translating a LPN 406 to a MPN 410, MMU 450 typically first looks in TLB 454 for the required mapping. If a valid mapping from LPN 406 to MPN 410 is found, the cached mapping is generally used, and the MMU 450 generally does not need to use the guest page table 402 or the NPT or EPT 404 to determine the appropriate translation. If a valid mapping from LPN 406 to MPN 410 is not found, however, the MMU 450 must generally perform a page table walk to determine the translation. Such a page table walk is described below in connection with FIGS. 2B and 2C.
Prior patents and applications assigned to VMware have used the terms GVPN (Guest Virtual Page Number), GPPN (Guest Physical Page Number) and PPN (Physical Page Number) in describing address translations in virtualized computer systems. LPN, as used in this patent, is analogous to GVPN, as used in some prior VMware patents; PPN, as used in this patent, is analogous to GPPN, as used in some prior VMware patents; and MPN, as used in this patent, is analogous to PPN, as used in some prior VMware patents.
FIG. 2B illustrates in greater detail how the MMU 450 performs a page table walk on the guest page table 402, according to one configuration, to translate from LPN 406 to PPN 408, and further uses NPT or EPT 404 to translate from PPN 408 to MPN 410. For purposes of this patent, a translation from LPN 406 to PPN 408 will be referred to as a “guest translation,” while a translation from PPN 408 to MPN 410 will be referred to as a “host translation.” Although the terminology used by Intel for guest page tables and EPT and the terminology used by AMD for guest page tables and NPT is different, the structure and use of these page tables are substantially similar, and, although the following description uses terminology from Intel literature, a person of skill in the art will also understand the structure and process as they relate to AMD CPUs, as well as other possible hardware-assist CPUs. FIG. 2B shows a 3-level structure for guest page table 402, although structures having different numbers of levels are also possible. A person of skill in the art will understand other possible structures and their use, based on existing literature, including, in particular, relevant literature from Intel and AMD. Thus, guest page table 402 comprises a page directory 402-1, a page table 402-3 and a page frame 402-5. Actually, as is well known, virtualized computer systems typically comprise numerous guest page tables 402, each with its own page directory 402-1, and each guest page table 402 typically comprises a plurality of page tables 402-3 and a plurality of page frames 402-5, however, for simplicity, FIG. 2B shows only the page directory, page table and page frame involved in a current address translation. The page directories 402-1 and the page tables 402-3 are referred to collectively herein as “guest page table pages,” while the page frames 402-5 are referred to herein as “guest data pages.”
As also shown in FIG. 2B, linear address 406A comprises a directory value 406-1, a table value 406-2 and an offset value 406-3. LPN 406 comprises the directory value 406-1 and the table value 406-2. Along with guest page table 402, control register CR3 412 is also maintained by guest OS 220. CR3 412 specifies a base address for page directory 402-1 in the form of a PPN (or in the form of a physical address, depending on the paging mode). MMU 450 performs a host translation 409-1 to translate this PPN into MPN 410-1 using NPT/EPT 404. The structure of NPT/EPT 404 and the process for its use in translating from PPN to MPN is described below in connection with FIG. 2C. MPN 410-1 specifies the base address of page directory 402-1 in terms of a machine address. The directory value 406-1 is then used as an index into page directory 402-1 to select page directory entry 402-2. Entry 402-2 specifies the base address for page table 402-3 again in the form of a PPN. MMU 450 performs another host translation 409-2 to translate this PPN into MPN 410-2 using NPT/EPT 404. MPN 410-2 specifies the base address of page table 402-3 in terms of a machine address. The table value 406-2 is then used as an index into page table 402-3 to select page table entry 402-4. Entry 402-4 specifies the base address for page frame 402-5 again in the form of a PPN. MMU 450 performs another host translation 409-3 to translate this PPN into MPN 410-3 using NPT/EPT 404. MPN 410-3 specifies the base address of page frame 402-5 in terms of a machine address. Page frame 402-5 includes the memory location for the memory access. The actual machine address (MA) 402-6 for the memory access is determined by adding the offset 406-3 to MPN 410-3.
FIG. 2C illustrates in greater detail how the MMU 450 uses the NPT or EPT 404, according to one configuration, to perform a host translation, such as the host translations 409-1, 409-2 and 409-3, translating from PPN 408 to MPN 410. FIG. 2C shows a 4-level structure for NPT/EPT 404, although structures having different numbers of levels are also possible. A person of skill in the art will understand other possible structures and their use, based on existing literature, including, in particular, relevant literature from Intel and AMD. Thus, NPT/EPT 404 comprises a PML4 table 404-1, a page directory pointer table 404-3, a page directory 404-5 and a page table 404-7. At the same time, guest physical address 408A comprises an upper portion 408-1 that is not used for translating to MPN, a PML4 value 408-2, a directory pointer value 408-3, a directory value 408-4, a table value 408-5 and an offset value 408-6. PPN 408 comprises the PML4 value 408-2, the directory pointer value 408-3, the directory value 408-4 and the table value 408-5. Along with NPT/EPT 404, Virtual Machine Control Structure (VMCS) 413 is also maintained by VMM 300. VMCS 413 includes EPT PTR 413-1, which specifies a machine base address for PML4 table 404-1. The PML4 value 408-2 is then used as an index into PML4 table 404-1 to select PML4 table entry 404-2. The entry 404-2 specifies a machine base address for page directory pointer table 404-3. The directory pointer value 408-3 is then used as an index into page directory pointer table 404-3 to select page directory pointer table entry 404-4. Entry 404-4 specifies the machine base address for page directory 404-5. The directory value 408-4 is then used as an index into page directory 404-5 to select page directory entry 404-6. The entry 404-6 specifies a machine base address for page table 404-7. The table value 408-5 is then used as an index into page table 404-7 to select page table entry 404-8. Entry 404-8 specifies MPN 410 corresponding to PPN 408.
As mentioned above, MMU 450 generally stores recently used mappings related to guest page table 402 and NPT/EPT 404 in TLB 454 and in paging structure caches 456 to speed up subsequent memory accesses. Different types of mappings may be stored in TLB 454 and caches 456, as described in existing literature, including, in particular, relevant literature from Intel and AMD. For example, recent mappings from LPN 406 to PPN 408 and recent mappings from LPN 406 to MPN 410 may be stored in TLB 454, while recent mappings from PPN 408 to MPN 410 and recent mappings from PML4 value 408-2 to the contents of PML4 table entry 404-2 may be stored in caches 456. If the guest OS 220 makes changes to guest page table 402 and/or the VMM 300 makes changes to NPT/EPT 404, one or more of the mappings stored in TLB 454 and caches 456 may become stale relative to the corresponding mappings in guest page table 402 and/or NPT/EPT 404. Inconsistencies can arise between the mappings in guest page table 402 and NPT/EPT 404 on one hand and the cached mappings in TLB 454 and caches 456 on the other hand. Such inconsistencies can give rise to non-determinism. Software generally cannot determine exactly which mappings are stored in TLB 454 and caches 456 because mappings can be stored and/or evicted by unpredictable means. For example, mappings can be evicted from TLB 454 due to capacity evictions, System Management Interrupts and speculative TLB prefetches. Thus, the mapping used for a memory access can depend on whether a particular mapping has been stored or evicted from TLB 454 or caches 456. Suppose for example, MMU 450 stores a mapping from a first LPN to a first MPN in TLB 454, based on the mappings in guest page table 402 and NPT/EPT 404. Next, suppose that guest OS 220 changes guest page table 402, so that the first LPN should now map to a second MPN. Suppose next that there is a memory access to the first LPN before any TLB flush (or relevant TLB invalidation). The mapping used by MMU 450 for this memory access to the first LPN will depend on whether or not the mapping from the first LPN to the first MPN has been evicted from TLB 454. If the mapping has not been evicted, then MMU 450 will generally map the first LPN to the first MPN based on the cached mapping, while, if the mapping has been evicted, the MMU 450 will walk the guest page table 402 and determine that the first LPN should map to the second MPN.
The VMM 300 can eliminate any possible non-determinism resulting from changes it makes to NPT/EPT 404 by flushing the TLB 454 and/or the caches 456, and/or by invalidating entries in the TLB 454 and/or the caches 456. A person of skill in the art will understand how to do this, based on existing literature, including, in particular, relevant literature from Intel and AMD.
In some existing virtualized computer systems, the VMM 300 can also eliminate any possible non-determinism resulting from changes the guest OS 220 makes to the guest page table 402. In existing virtualization products from VMware, for example, the VMM 300 can place traces on all physical memory pages that constitute the guest page table 402. Traces are described in earlier patents owned by VMware. If the guest OS 220 attempts to write to the guest page table 402, the VMM 300 is activated and alerted to the attempted write. The VMM 300 may allow the attempted write to take place, but then the VMM 300 can also eliminate any possible non-determinism by flushing the TLB 454 and/or the caches 456, and/or by invalidating appropriate entries in the TLB 454 and/or the caches 456.
However, to fully take advantage of efficiencies of the modern CPUs described above, the VMM 300 preferably does not place traces on the physical memory pages containing the guest page table 402. Instead, the VMM 300 should allow the guest OS 220 to write to the guest page table 402, without any such traces. In this case, however, the VMM 300 generally cannot eliminate all possible non-determinism resulting from changes to the guest page table 402 by the guest OS 220.