1. Field of the Invention
This invention relates generally to memory management, and more particularly to systems and methods for using writeable page tables to increase performance of memory address translation in computing environments utilizing a hypervisor.
2. Background
Operating systems, or OS's, are typically designed to run multiple user processes, and more particularly multiple applications, each running in their own virtual address space in virtual memory. Through allocation of virtual memory to each such application, applications are provided a virtual address space in which they can operate, which address space is separate from all other applications. Such virtual address space may start at the 0 address and extend to, for example, 3 GB (or as far as the system architecture will allow). The use of virtual memory allows us to split physical memory (in which the user code and data actually resides) into chunks, and enables the OS to manage those chunks as they are needed for various applications. However, those chunks of physical memory that are used for a single application may not be arranged in a single contiguous region. Virtual memory provides a mechanism by which the chunks of physical memory may be managed with some structure to enable, from the user's perspective, generally contiguous and smooth operation of an application despite mapping portions of each such application into and out of relevant blocks of physical memory as required. In order to properly interrelate the virtual memory locations with physical memory locations, the virtual memory system will translate virtual addresses to physical addresses, storing mappings from one to the other in a page table.
Both the virtual and physical memory are divided into chunks. On an x86 architecture, the virtual address space and physical memory address space are both commonly split into pages 4 kilobytes in size (for physical memory, such chunks are sometimes referred to as “page frames” or simply “frames” instead of pages). Memory that is allocated to each application is allocated in pages within the physical memory of the system. Notably, those pages in physical memory may not be contiguous, because as pages get used, new applications get started and/or terminated, and memory becomes fragmented. For example, an instruction for a user process might start at the virtual address 0x00000004, indicating page number 0x0, and offset 0x004. In reality, this may correspond to the physical address 0xff0e0004, indicating frame number 0xff0e, and offset 0x004. Another instruction for that same application may be in an entirely distinct physical address region in physical memory. Thus, it is necessary to translate from a given virtual address that the application sees to a physical address. The virtual memory system converts the virtual addresses into physical addresses, essentially mapping between pages and frames, using the page table.
In the case of the Intel x86 processor, such memory management is implemented using hardware page tables. For each application that is running, the OS maintains a page table, the format of which is determined by the x86 specification. When the user switches from one application to another, the CPU switches to the page table associated with the selected application. In the case of the basic 32-bit x86 architecture, the format of the page table is a 2-level page table. The page table translates 32-bit virtual addresses into 32-bit physical address. Analyzing the structure of each such 32-bit address, it comprises 12 least significant bits that are the offset within the page and an additional 20 bits. When translating from a virtual page number to a physical page number, the 12 bits proceed through such translation unmodified. The remaining 20 bits, however, are divided into 2 lots of 10 bits each and, as discussed in detail below, are used to ultimately identify the address in physical memory to which the virtual address is to be translated.
As a preliminary matter, however, we note that for every process the OS maintains a page table and the hardware is configured such that whenever the user starts running a new process, it first points to the root of the page table for that process. The root of the page table is itself a page referred to as the page directory (“PGD”). The PGD is a page that has been allocated in memory that uses a number of further pages to represent a hierarchical page table structure that defines the translation. The PGD is referred to herein as an “L2,” because it is the root of a 2-level page table, i.e., the uppermost level (level 2). Likewise, all of the leaves of the page table are referred to herein as “L1.” The L2 PGD thus contains pages of 4 kb each, and we consider the PGD as having 1024 entries, each of which is 4 bytes in size. When performing a translation, i.e., when the hardware wants to convert a virtual address into a physical address, the first thing that the hardware does is to look at the most significant 10 bits of the virtual address, and use those 10 bits as an index into the 1024 entries in the L2 PGD. The hardware will thus extract one of those 4-byte entries. Within the extracted 4-byte entry there is another 20 bits of address information along with 12 bits containing status information. The 20 bits indicate the physical machine frame number of the L1 page table page that must next be consulted to perform the translation. The L1 page table page is very similar to the PGD, containing 1024 entries, each of 4 bytes. The next most significant 10 bits of the virtual address are then used to index into the identified L1 page table. The resulting entry in that L1 page table comprises 4 bytes that again contain 20 bits and 12 bits as above. The 20 bits produced from this entry indicate the physical frame number to which the virtual page number needs to be converted. That will serve as the memory location that the hardware actually reads from or writes to when the application refers to a particular virtual address. The remaining 12 bits that index within the page are combined with that new physical page frame number to produce the actual physical address that is to be accessed.
The issue of translating virtual addresses to physical addresses becomes more complicated in a hypervisor environment. In a virtual machine (“VM”) environment, a single physical machine with one or more physical processors combines with software to simulate multiple virtual machines. Each of those virtual machines requires access to some of the physical resources of the computer on which it resides. A hypervisor controls the assignment of resources to each virtual machine. Thus, the hypervisor, as opposed to the OS's on the virtual machines, typically controls the allocation of physical hardware resources. The hypervisor typically intercepts requests for resources from the OS's on the virtual machines and manages those requests to avoid conflict among the separate OS's. As the OS is traditionally the most privileged entity running on the computer, and thus in the typical case has the ability to manipulate virtual and physical memory, in a hypervisor environment it is necessary to restrict such capabilities. The way that most hypervisors provide for such control is through the use of shadow page tables. In this case, the guest OS's that are running on the computer will, as discussed above, maintain their own page table structures for each of their own processes. The page tables that the hardware is aware of are entirely separate and maintained by the hypervisor. It has been found, however, that it is advantageous to allow the guest OS to directly use the hardware page tables, thus avoiding the memory cost of an additional copy and the computational cost of keeping the two copies synchronized. To do so requires careful control to ensure that manipulations by one OS do not conflict with any other OS's (e.g., one OS must be prevented from accessing another OS's memory).
A system and method allowing the hardware page tables to be the same page tables that the guest OS's are manipulating is described in “Xen and the Art of Virtualization,” published in the Proceedings of the Association of Computing Machinery (ACM) 19th Symposium on Operating Systems Principles (SOSP19), October 2003, and incorporated herein by reference. As described therein, each guest OS is capable of reading all of its own page tables, having a page table for every process of which that OS is aware. However, while the guest OS's are fully capable of reading their own page tables, writing to those page tables could be problematic. If a guest OS could write directly to a page table, it could create any value into one of the L1 or L2 page tables, and could particularly input to such page table a 20-bit value that refers to a page frame that it does not own. That is, it could refer to one of possibly a million pages in the system that is not allocated to that OS. It is thus necessary to check that any value that is being placed into the page directory or into the page tables (i.e., the L2 and L1, respectively) refers to pages that the subject OS owns. Otherwise, such an entry could create a mapping to a page owned by another OS which, when used by hardware, could read or even corrupt another OS's data. Further, it is necessary to check that the value being written into the page table does not create a writeable mapping to a page that is being used as an L1 or L2 page table page, since otherwise this mapping could be used to update the page table directly and hence circumvent the hypervisor.
Thus, whenever a guest OS attempts to write to its page table, the hypervisor is invoked through a fault handler which in turn evaluates the attempt by the guest OS to write a particular value into one of the L1 or L2 pages to determine whether the attempted write corresponds to a page frame that the OS owns. If the OS does own that page frame, the write is allowed to proceed, by emulating the access in the hypervisor. Such updates are typically performed in batches. Thus, in a simplistic implementation of the foregoing, every time a write is performed to update a page table, the hypervisor would be invoked to check whether the update is valid. If the update is deemed valid, the hypervisor would perform the access, and then return control to the OS at the instruction following the page table write.
Thus, using the above described method, wherever a page table page is written to by a guest OS, the system faults into the hypervisor, performs the update, and returns. However, typical OS operations, such as exiting or otherwise terminating an application process, result in many such consecutive updates in order to destroy the process's page tables. The OS thus proceeds through a loop, terminating all of the L1's by zeroing them and zeroing the L2, such that the process page table is effectively “dead.” This requires a large number of page table updates to be performed in a batch. Similarly, when creating a new process, OS's will typically take the currently running application and create a copy of its page table. Once again, a loop is performed which reads from the current page table and builds a new page table. Thus, the OS will allocate a page to be the L2, the entries will be copied across and a set of pages will be allocated to be the L1's (up to 1024 for a given process). These large sets of updates can lead to a large increase in process creation and process termination time, particularly if a simple emulation scheme is employed which requires entering and leaving the hypervisor frequently.
Thus, there remains a need in the art for a system and method capable of efficiently carrying out memory address translation in a hypervisor environment.