Virtual machine technology allows multiple virtual machines, each having their own operating system, to run on a single physical machine. The virtual machine is called a guest with respect to the host computer. The host computer has virtualization software that emulates the processor architecture and the hardware resources required by the software running inside a virtual machine. The virtualization software emulates the instructions executed by the guest and virtualizes the hardware resources of the host machine for use by the guest.
With virtual machine technology, the guest computer system exists on the host computer system as a pure software representation of the operation of the hardware architecture of the virtual machine. The virtualization software executing on the hardware architecture of the host computer mimics the operation of the entire guest computer system. The virtualization software acts as the interface between the hardware architecture and resources of the physical machine and the instructions executed by the software (e.g., operating systems, applications, etc.) running in the virtual machine environment. In one embodiment of virtual machine technology, the virtualized environment is created by a Virtual Machine Monitor (VMM), which is a software layer that runs directly on top of the host hardware but below the guest software and may run side-by-side and work in conjunction with a host operating system. The VMM can virtualize the resources of the host machine by exposing interfaces that match the guest's hardware architecture and by intercepting and virtualizing operations performed by the guest. This virtualization enables the virtualization software (and the host computer system) to go unnoticed by the guest operating system running on top of it, although this virtualization does incur a performance penalty as it consumes host resources (e.g., the processor). In addition, it is desirable to keep one virtual machine separated from other virtual machines, as well as from the host. Separation of one virtual machine from another is useful to isolate faults such that a fault in one virtual machine does not affect another virtual machine.
Computer operating systems typically provide isolation between different applications so that when one application fails, others are not affected. One technique is to execute each application in a separate process with its own address space. With this mechanism, each process is presented with virtual memory addresses that it can read from or write to, and the operating system takes care of backing the pages of virtual memory used by the application with pages of physical memory and maintaining a map of virtual addresses (VAs) to physical addresses (PAs). Since a process can only access memory via virtual addresses, the operating system is able to restrict the pages of physical memory that a process may access. When a program accesses a virtual memory address, the processor translates the virtual address into a physical address by consulting the VA-to-PA map specified by the operating system.
One processor architecture that may be used in the host computer system is the x86 family of processors. Modern x86 processors have a built-in memory address map lookup mechanism that efficiently converts a virtual address to a physical address via a procedure known as a page table walk. Page tables specify a VA-to-PA map, which the operating system maintains for each virtual address space. FIG. 1 shows a typical prior-art page table walk process, implemented in hardware in an x86 processor.
An x86 processor with paging enabled relies on a set of page directory tables (PDTs) that point to page tables (PTs) to map virtual addresses to physical addresses. Most modern x86 processors also support Physical Address Extension (PAE), which increases the amount of physical memory the machine can support. With PAE enabled, a third-level address translation lookup table called the page directory pointer table (PDPT) is also used. Finally, x86 processors with 64-bit extensions require a fourth-level address translation lookup table called a page map level 4 table (PML4T). In all of these variations, the approach is the same: each upper-level page table references one or more page tables for the next level. Each tree of page tables specifies a sparse map of the virtual address space.
FIG. 1 represents a standard x86 scheme 100 for resolving a virtual address 190 to a physical address. A processor control register 110 specifies the physical address 115 of the page directory pointer table (PDPT). In the example of an x86 processor, the control register is known as CR3. The top bits 192 of the virtual address provide an index 125 into the PDPT to select from a multiplicity of page directory tables (PDTs) 130, 140. In the FIG. 1 example, the VA selects the PDT 140, which points to a multiplicity of page tables 160, 170. The middle bits 194 in the VA indexes 145 into the PDT to select a page table 170. A specific entry 180 in the selected page table 170 is selected by another field of bits 196 in the VA. The entry 180 is the physical address corresponding to the virtual address 190 being resolved, with the bottom bits 198 specifying the offset into the page. In summary, the x86 processor resolves a virtual address by walking down the page table tree starting at the top-level page table specified in the control register (CR3) and indexing into each page table using parts of the VA. The selected entry in the leaf page table stores the related physical address.
Modern x86 processors also feature a translation look-aside buffer (TLB) in which it caches the most recently traversed page table mappings, by storing pairs of virtual addresses and physical addresses. Upon a memory access, the processor checks if the desired VA-to-PA translation is already cached in the TLB. If it is, the TLB hit allows the processor to skip the page table walk of FIG. 1, resulting in higher performance because of the lower address translation latency. If the translation is not cached, the TLB miss requires a higher expense walk of the page tables. The x86 architecture dictates the semantics for keeping the TLB, which is a non-coherent cache of the page tables, synchronized with the page tables, when the operating system makes modifications to them. A page table modification is only effective after the operating system invalidates the affected VAs using the INVLPG instruction or flushes the TLB of all stale VA-to-PA translations by modifying the control register (CR3). In the x86 architecture, writing to CR3 not only changes the base of the page table tree used for address translations but also flushes the TLB of stale translations to achieve synchronization to the page tables. Many processors indiscriminately flush the entire TLB on a write to CR3. However, processors with a tagged TLB associate each translation with a particular address space (a tag) and snoop the bus for memory writes to detect translations that become stale due to page table modifications, so they can retain translations across address space switches by removing only the stale translations.
To isolate virtual machines, the virtualization software allocates separate portions of host memory to different virtual machines. However, this requires the physical addresses of the guest to be virtualized, so that while a guest may think a page of memory is at a certain physical address, that page actually resides at a different physical address in the host memory system. An address into what the guest thinks is physical memory is called a guest physical address (GPA), and an address into the host's physical memory is called a system physical address (SPA). Typically, GPAs do not correspond to SPAs. Therefore, the virtualization software must enforce an additional translation from GPA to SPA, while preserving the translation from VA to GPA enforced by the guest operating system. Consequently, the guest page tables cannot be directly traversed by the x86 processor's address translation hardware, because they map VAs to GPAs, not to SPAs. On every memory access, an unoptimized VMM must manually walk the guest page tables to translate the VA to a GPA (losing the benefit of the page table walking hardware), and then translate the GPA into an SPA using its internal tables.
To improve the efficiency of memory virtualization, an optimized VMM may take the result of the VA-to-SPA translation it performs and cache it in a format that can be efficiently accessed in the future. One solution is for the VMM to maintain shadow page tables (SPTs) that map guest VAs to SPAs and are walked by the processor when resolving VAs. That way, when a VA cached in the SPTs is accessed, the page table walk hardware can use the SPTs to directly translate the VA into an SPA, thus allowing the guest to access memory without any intervention by the VMM. When the VA is not cached, the processor's address translation hardware generates a page fault, which the VMM receives and must service by walking the guest page tables and creating a VA-to-SPA translation in the SPTs. This VMM intervention is very expensive in terms of machine cycles. Although the hardware page table walk shown in FIG. 1 may take ten to a few hundred machine cycles, the VMM intervention may take several thousands of machine cycles, which is up to two orders of magnitude more costly.
VA-to-SPA translations cached in the SPTs are almost analogous to VA-to-PA translations cached in a TLB of a physical processor, because those VAs can be resolved without the VMM looking at the page tables. Therefore, the SPTs effectively form a virtual TLB with respect to the guest. This virtual TLB preferably has the same behavior as a physical TLB, so it preferably flushes all stale translations whenever the guest modifies the control register 110 in FIG. 1, which occurs whenever the operating system switches between two processes (hence, between two different address spaces with their own page table tree). On many processors, the hardware TLB flushes all translations indiscriminately, and the analogous operation for the virtual TLB would be to flush all entries in the SPTs. However, the cost of repopulating a VA-to-SPA entry in the virtual TLB is up to two orders of magnitude greater than the cost of repopulating a VA-to-PA entry in a physical TLB, and the virtual TLB is much larger than a physical TLB, so the impact of indiscriminately flushing all entries in the virtual TLB, instead of flushing only the stale entries, is significantly higher. Most operating systems frequently switch between address spaces and consequently flush the TLB frequently. The impact of flushing the TLB on a physical machine is small, but in a virtual machine environment, a large percentage of host processor cycles may be spent re-populating the virtual TLB as a result of TLB flushes by the guest operating system.
Thus, there is a need for a method and system to implement a highly efficient translation from a virtual address to a system physical address in a virtual machine environment. It would be advantageous to implement a virtual TLB that retained as many cached address translations as possible across both address space switches and indiscriminate flushes of the entire TLB by the guest, in a manner that preserves the TLB semantics of existing processor architectures. Accordingly, the virtualization software may efficiently support existing operating systems for those architectures without requiring changes to those operating systems. In addition, such a virtual TLB is preferably not be too expensive in terms of computer resources such as memory size or processor cycles, and for this technique to be worth the added complexity. Accordingly, the virtual TLB is preferably more efficient than a simple virtual TLB that flushes all translations on an address space switch. The present invention addresses the aforementioned needs and solves them with additional advantages as expressed herein.