Modern computer systems provide support for virtual memory. With this support, an application sees a contiguous address space that is not tied to underlying physical memory in a computer system. In particular, an application accesses a virtual address (VA) that is translated to a physical address (PA) which is used to access the physical memory. The translation proceeds in one of two ways, depending on whether the system architecture supports “segmentation” or not. If the system architecture supports segmentation, then a VA is added to a value in a “segment base register,” to produce what is called a linear address (LA). The value of the VA is compared with a value in another register called a “segment limit register.” If the value of the VA is greater than the value in the segment limit register, then a “general protection fault” is generated, and execution jumps to a fault handler. Otherwise, the LA is translated to the PA using hardware called a Memory Management Unit (MMU). If the system architecture does not support segmentation, then the LA is the same as the VA, and the VA is used by the MMU to translate to the PA.
FIG. 1B is a block diagram that helps illustrate how the MMU performs the translation process. The following uses a 64 bit architecture and a 4 KB page size for sake of illustration. However, similar steps may be utilized for a 32 bit architecture, and different page sizes may be utilized as well. A more detailed discussion of translation processes can be found in “Intel 64 and IA-32 Architecture Application Note: TLBs, Paging-Structure Caches, and Their Invalidation” available from Intel of Santa Clara, Calif. Although this specification describes translation processes for Intel processors, the details are similar for other popular x86 based processors, and embodiments of the invention may be implemented on any x86 or non-x86 based processor that supports virtual memory.
With respect to the x86 architecture, the MMU consists of the following parts: a control register CR3, a translation look aside buffer (TLB), and translation circuitry. The paging structures: (a) are 4 KB in size; (b) reside in main memory, generally separate from the MMU; and (c) are designated L4, L3, L2 and L1 page tables (they are sometimes also called level 4, 3, 2 and 1 page tables), respectively. In some implementations, the page table structures contain 512 8-byte entries comprising information required to perform the translation. The following information is relevant here. First is a CR3 register which contains the physical address of some L4 page table. Second is a physical page number of the next lower level page table. For example, referring to FIG. 1B, a page table entry in the L4 page table contains the physical page number of the L3 page table. Third is a present bit (P-bit). The P-bit indicates whether a page table entry is “present” or “not present.” If the P-bit is 0 in any page table entry, translation stops and a page fault is generated. Fourth is an access bit (A-bit). When a linear address is used to access memory, the processor sets the A-bit to 1 in all page table levels used to translate the linear address.
As is well known, translation of a linear address LA (generally referred to as “walking the page table” or a “page table walk”) proceeds as follows. A linear page number (LPN) is formed from the most significant 52 bits of the LA. The least significant 12 bits of the LA constitute what is called an “offset.” Using the CR3 register, the physical address of the L4 page table is determined. An entry in the L4 page table is selected by bits 47:39 of the LA; the selected entry contains a physical address of the L3 page table. An entry in the L3 page table is selected by bits 38:30 of the LA; the selected entry contains a physical address of the L2 page table. An entry in the L2 page table is selected by bits 29:21; the selected entry contains a physical address of the L1 page table. An entry in the L1 page table is selected by bits 20:12; the selected entry contains a PPN corresponding to the LPN. Finally, the PA comprises the PPN as its 52 most significant bits, and the offset as its 12 least significant bits.
As is well known, an MMU is typically equipped with one or more TLBs, where a TLB is a cache of recent LPN to PPN translations. To translate an LA, the MMU computes the LPN, and then looks in the TLB for a translation of the LPN. If the translation is present in the TLB (referred to as a “TLB hit”), the PPN is immediately available from the TLB. On the other hand, if the translation is not present in the TLB (referred to as a “TLB miss”), a page table walk is done, and the translation is stored in the TLB, potentially evicting another entry from the TLB.
As is well known, virtualizing an MMU so that multiple virtual machines can run on a single hardware system typically entails another level of translation. The first translation is provided by a guest operating system (guest OS) running in a virtual machine. The guest OS translates a guest LPN (GLPN) into a corresponding guest PPN (GPPN) in the conventional manner. The second translation is provided by virtualization software, for example, a virtual machine monitor (VMM). In particular, the VMM maintains a GPPN to “host PPN” mapping in its internal translation table where the host PPN is used to address physical memory of the hardware system. In the rest of this specification, this translation table will be referred to as T.
One of two methods is typically used for virtualizing an MMU, namely a shadowing of guest paging structures method (shadowing method), and a hardware assist method. FIG. 1C illustrates the shadowing method. As shown in FIG. 1C, in the shadowing method for virtualizing an MMU, virtualization software, for example, the VMM, maintains shadow page tables, one shadow page table for each guest page table. While the guest page tables, maintained by the guest operating system, contain guest LPN to guest PPN mappings, the shadow page tables contain guest LPN to host PPN mappings. To insert a translation for a guest LPN gl into a shadow page table, the VMM walks the guest page table to determine the guest PPN gp. Then, it translates the guest PPN gp to a host PPN hp using its translation table T.
In operation, for a particular guest, the VMM changes the CR3 register so that it points to the shadow page table for that guest. By doing this, the hardware MMU uses the shadow page table for address translation. Further, the VMM saves the value of the CR3 register for that guest. To ensure correct guest operation, the VMM traps guest accesses to the CR3 register. Thus, whenever the guest reads the CR3 register, the VMM provides it with the saved value of the CR3 register. Further, whenever the guest writes to the CR3 register, the VMM saves the new value of the CR3 register for the guest, and writes to the physical CR3 register the physical address of the new shadow page table.
When a new shadow page table is loaded by the VMM (by writing a new value to the CR3 register in response to a guest writing to the CR3 register), it is not necessary that the shadow page table contains valid entries corresponding to all valid entries in the guest page table. For a guest page table entry, if the PPN in its corresponding shadow page table entry is invalid, the VMM sets the P-bit of that shadow page table entry to 0. Then, when this shadow page table entry is used to translate an address, a page fault will be generated. Such a page fault is called a “hidden page fault” because it would not have been generated if the guest operating system were executing in a non-virtualized environment. In response, a page fault handler of the VMM determines the faulting guest linear page number gl (using a faulting virtual address put into a predetermined hardware register), and translates it to a guest physical page number gp by walking the guest page tables. Then, the guest physical page number gp is translated to a host physical page number hp using translation table T. Finally, the host physical page number hp is inserted into the concerned shadow page table entry, the P-bit of the entry is set to 1, and the faulting guest instruction is restarted.
The VMM needs to ensure that shadow page tables are synchronized with guest page tables. For example, if the guest OS changes a translation of guest linear address gl from gp to gp′, the VMM needs to update the corresponding entry in the shadow page table as well.
The hardware assist method for virtualizing an MMU makes use of a virtualization assist provided in certain modern CPUs. As is well known, apart from conventional page tables provided by earlier CPUs, certain modern CPUs provide a second layer of page tables. The design of the second layer of page tables, called “hypervisor page tables,” is similar to that of the guest page tables as described above. The guest page tables, maintained by the guest operating system, continue to contain guest linear to guest physical translations. However, the hypervisor puts the guest physical to host physical translations in the second layer of page tables (description of hypervisor is provided in the Appendix below). Apart from the CR3 register provided by earlier CPUs, the virtualization assist of the modern CPUs provides an hCR3 register which contains the host physical address of the L4 page of the hypervisor page tables.
With the hardware assist, on a TLB miss, the MMU walks through both sets of page tables to translate a guest linear address to a host physical address. This is done by using translations from the guest page tables and the hypervisor page tables: the guest linear address is translated to a guest physical address using the guest page tables, and the guest physical address is translated to a host physical address using the hypervisor page tables. Using the hardware assist obviates the construction and maintenance of shadow page tables.
FIG. 1D illustrates an example of virtual machine memory virtualization. As shown in FIG. 1D, each of the virtual machines VM1 and VM2 runs two processes: process 1 and process 2. The processes execute using virtual memory 122, which is mapped to physical memory 124, and physical memory 124 is mapped, in turn, to machine memory 126. The guest OS in each virtual machine maps virtual memory (i.e., virtual page numbers (VPN)) to guest physical memory (i.e., guest physical page number (GPPN))—this is shown by inset 112. In addition, virtualization software, for example, a VMM, maintains a mapping of guest physical page numbers (GPPN) to actual machine or host page numbers (MPN)—this is shown by inset 114. The VMM also maintains a mapping of VPN to MPN in shadow page tables to map virtual memory directly to machine memory (i.e., VPN to MPN) to avoid two levels of translation on every access—this is shown by inset 116. When the guest OS changes the virtual memory to physical memory mapping, the VMM updates (or invalidates) the shadow page tables to enable a consistent lookup (or generation of a hidden page fault).
Various advancements in microprocessor fabrication have led to modern CPUs that are significantly faster than main memory. Consequently, CPUs running demanding applications may become constrained by memory bandwidth. This issue may be exacerbated in symmetrical multiprocessing (SMP) systems wherein applications running on several CPUs access main memory together, thereby competing for access to a shared memory bus. To overcome this issue, a Non-Uniform Memory Access (NUMA) architecture has been developed.
In NUMA machines, processors are divided into NUMA nodes where each node has a separate memory, and a bus connects the processors of a node to the memory. An example is illustrated in FIG. 1E where each NUMA node, NUMA 101 and NUMA 102, has several processors coupled to local memory. As further shown in FIG. 1E, the different NUMA nodes are connected by an interconnect 103. In NUMA machines, each processor can access any memory, including memory on different NUMA nodes; however, access to memory on the same NUMA node (referred to as “local memory”) is faster than access to memory on a different NUMA node (referred to as “remote memory”). Due to this, software designers attempt to ensure that data most often accessed by a program running on a CPU (called a “working set” of the program) usually reside on the same NUMA node as the CPU.
From time to time, it may become necessary or desirable to migrate a virtual machine from one NUMA node to another. Consider, for example, a situation illustrated in FIG. 1E where NUMA node 101 includes processor A having two virtual machines, VMA1 and VMA2, running demanding applications, thereby heavily loading processor A. On the other hand, assume that one or both of processors A′ and B′ on NUMA node 102 are lightly loaded. To alleviate the load on processor A, it may be desirable or necessary to migrate one of VMA1 and VMA2 to NUMA node 102. Such load balancing may be done automatically by continuously monitoring the load on all of the nodes.