1. Field of the Invention
This invention relates to a first software entity transparently using an address space of a second software entity, while preventing the second software entity from accessing memory of the first software entity.
2. Description of the Related Art
In this patent, a particular embodiment of the invention is described in terms of a virtual computer system in which virtualization software runs on a physical computer system and supports a virtual computer, or virtual machine. Guest software, such as a guest operating system (OS) and guest applications, may be loaded onto the virtual computer for execution. The virtualization software occupies a portion of a linear address space of the guest software. This embodiment of the invention relates to protecting the virtualization software from the guest software. In particular, this embodiment of the invention may be implemented as an improvement to existing virtualization products of the assignee of this patent, VMware, Inc. of Palo Alto, Calif. Consequently, this description begins with an introduction to virtual computing and the virtualization products of VMware.
Virtualization has brought many advantages to the world of computers. As is well known in the art, a virtual machine (VM) is a software abstraction—a “virtualization”—of an actual physical computer system that runs as a “guest” on an underlying “host” hardware platform. As long as a suitable interface is provided between the VM and the host platform, one advantage is that the operating system (OS) in the guest need not be the same as the OS at the system level in the host. For example, applications that presuppose a Microsoft Windows OS can be run in the VM even though the OS used to handle actual I/O (input/output), memory management, etc., on the host might be Linux.
It usually requires less than 10% of the processing capacity of a CPU to run a typical application, although usage may peak briefly for certain operations. Virtualization can more efficiently use processing capacity by allowing more than one VM to run on a single host, effectively multiplying the number of “computers” per “box.” Depending on the implementation, the reduction in performance is negligible, or at least not enough to justify separate, dedicated hardware “boxes” for each user or application.
Still another advantage is that different VMs can be isolated from and completely transparent to one another. Indeed, the user of a single VM will normally be unaware that he is not using a “real” computer, that is, a system with hardware dedicated exclusively to his use. The existence of the underlying host will also be transparent to the guest software itself. The products of VMware provide all of these advantages in that they allow multiple, isolated VMs, which may (but need not) have OSs different from each other's, to run on a common hardware platform.
Example of a Virtualized System
FIG. 1 illustrates the main components of a system that supports a virtual machine as generally implemented in the Workstation product of VMware, Inc. As in conventional computer systems, both system hardware 100 and system software 200 are included. The system hardware 100 includes CPU(s) 102, which may be a single processor, or two or more cooperating processors in a known multiprocessor arrangement. The system hardware also includes system memory 104, one or more disks 106, and some form of memory management unit (MMU) 108. As is well understood in the field of computer engineering, the system hardware also includes, or is connected to, conventional registers, interrupt-handling circuitry, a clock, etc., which, for the sake of simplicity, are not shown in the figure.
The system software 200 either is or at least includes an operating system (OS) 220, which has drivers 240 as needed for controlling and communicating with various devices 110, and usually with the disk 106 as well. Conventional applications 260, if included, may be installed to run on the hardware 100 via the system software 200 and any drivers needed to enable communication with devices.
As mentioned above, the virtual machine (VM) 300—also known as a “virtual computer”—is a software implementation of a complete computer system. In the VM, the physical system components of a “real” computer are emulated in software, that is, they are virtualized. Thus, the VM 300 will typically include virtualized (“guest”) system hardware 301, which in turn includes one or more virtual CPUs 302 (VCPU), virtual system memory 304 (VMEM), one or more virtual disks 306 (VDISK), and one or more virtual devices 310 (VDEVICE), all of which are implemented in software to emulate the corresponding components of an actual computer. The concept, design and operation of virtual machines are well known in the field of computer science.
The VM's system software 312 may include a guest operating system 320, which may, but need not, simply be a copy of a conventional, commodity OS, as well as drivers 340 (DRVS) as needed, for example, to control the virtual device(s) 310. Of course, most computers are intended to run various applications, and a VM is usually no exception. Consequently, by way of example, FIG. 1 illustrates one or more applications 360 installed to run on the guest OS 320; any number of applications, including none at all, may be loaded for running on the guest OS, limited only by the requirements of the VM. Software running in the VM 300, including the guest OS 320 and the guest applications 360, is generally referred to as “guest software.”
Note that although the virtual hardware “layer” 301 will be a software abstraction of physical components, the VM's system software 312 may be the same as would be loaded into a hardware computer. The modifier “guest” is used here to indicate that the VM, although it acts as a “real” computer from the perspective of a user, is actually just computer code that is executed on the underlying “host” hardware and software platform 100, 200. Thus, for example, I/O to the virtual device 310 will actually be carried out by I/O to the hardware device 110, but in a manner transparent to the VM.
Some interface is usually required between the VM 300 and the underlying “host” hardware 100, which is responsible for actually executing VM-related instructions and transferring data to and from the actual, physical memory 104. One advantageous interface between the VM and the underlying host system is often referred to as a virtual machine monitor (VMM), also known as a virtual machine “manager.” Virtual machine monitors have a long history, dating back to mainframe computer systems in the 1960s. See, for example, Robert P. Goldberg, “Survey of Virtual Machine Research,” IEEE Computer, June 1974, p. 54-45.
A VMM is usually a relatively thin layer of software that runs directly on top of a host, such as the system software 200, or directly on the hardware, and virtualizes the resources of the (or some) hardware platform. FIG. 1 shows a VMM 400 running directly on the system hardware 100. The VMM will typically include at least one device emulator 410, which may also form the implementation of the virtual device 310. The interface exported to the respective VM is usually such that the guest OS 320 cannot determine the presence of the VMM. The VMM also usually tracks and either forwards (to the host OS 220) or itself schedules and handles all requests by its VM for machine resources, as well as various faults and interrupts. FIG. 1 therefore illustrates an interrupt (including fault) handler 450 within the VMM. The general features of VMMs are well known and are therefore not discussed in further detail here.
FIG. 1 illustrates a single VM 300 merely for the sake of simplicity; in many installations, there will be more than one VM installed to run on the common hardware platform; all may have essentially the same general structure, although the individual components need not be identical. Also in FIG. 1, a single VMM 400 is shown acting as the interface for the single VM 300. It would also be possible to include the VMM as part of its respective VM, that is, in each virtual system. Although the VMM is usually completely transparent to the VM, the VM and VMM may be viewed as a single module that virtualizes a computer system. The VM and VMM are shown as separate software entities in the figures for the sake of clarity. Moreover, it would also be possible to use a single VMM to act as the interface for more than one VM, although it will in many cases be more difficult to switch between the different contexts of the various VMs (for example, if different VMs use different guest operating systems) than it is simply to include a separate VMM for each VM. This invention works with all such VM/VMM configurations.
In all of these configurations, there must be some way for the VM to access hardware devices, albeit in a manner transparent to the VM itself. One solution would of course be to include in the VMM all the required drivers and functionality normally found in the host OS 220 to accomplish I/O tasks. Two disadvantages of this solution are increased VMM complexity and duplicated effort—if a new device is added, then its driver would need to be loaded into both the host OS and the VMM. A third disadvantage is that the use of a hardware device by a VMM driver may confuse the host OS, which typically would expect that only the host's driver would access the hardware device. In such systems, a better method has been implemented by VMware, Inc., in its Workstation product. This method is also illustrated in FIG. 1.
In the system illustrated in FIG. 1, both the host OS and the VMM are installed at system level, meaning that they both run at the greatest privilege level and can therefore independently modify the state of the hardware processor(s). For I/O to at least some devices, however, the VMM may issue requests via the host OS 220. To make this possible, a special driver VMdrv 242 is installed as any other driver within the host OS 220 and exposes a standard API to a user-level application VMapp 500. When the system is in the VMM context, meaning that the VMM is taking exceptions, handling interrupts, etc., but the VMM wishes to use the existing I/O facilities of the host OS, the VMM calls the driver VMdrv 242, which then issues calls to the application VMapp 500, which then carries out the I/O request by calling the appropriate routine in the host OS.
In FIG. 1, a vertical line 600 symbolizes the boundary between the virtualized (VM/VMM) and non-virtualized (host software) “worlds” or “contexts.” The driver VMdrv 242 and application VMapp 500 thus enable communication between the worlds even though the virtualized world is essentially transparent to the host system software 200.
In some cases, however, it may be beneficial to deploy VMMs on top of a thin software layer, a “kernel,” constructed specifically for this purpose. FIG. 2 illustrates an implementation in which a kernel 700 takes the place of and performs the conventional functions of the host OS, including handling actual I/O operations. Compared with a system in which VMMs run directly on the hardware platform, use of a kernel offers greater modularity and facilitates provision of services that extend across multiple virtual machines (for example, resource management). Compared with the hosted deployment, a kernel may offer greater performance because it can be co-developed with the VMM and be optimized for the characteristics of a workload consisting of VMMs.
As used herein, the “host” OS therefore means either the native OS 220 of the underlying physical computer, a specially constructed kernel 700 as described above, or whatever other system-level software handles actual I/O operations, takes interrupts, etc. for the VM. The invention may be used in all the different configurations described above.
Memory Mapping in a Virtual Computer System
Most modern computers implement a “virtual memory” mechanism, which allows user-level software to specify memory locations using a set of virtual addresses, which are then translated or mapped into a different set of physical addresses that are actually applied to physical memory to access the desired memory locations. The range of possible virtual addresses that may be used by user-level software constitute a virtual address space, while the range of possible physical addresses that may be specified constitute a physical address space. The virtual address space is typically divided into a number of virtual memory pages, each having a different virtual page number, while the physical address space is typically divided into a number of physical memory pages, each having a different physical page number. A memory “page” in either the virtual address space or the physical address space typically comprises a particular number of memory locations, such as either a four kilobyte (KB) memory page or a four megabyte (MB) memory page in an x86 computer system.
System-level software generally specifies mappings from memory pages in the virtual address space using virtual page numbers to memory pages in the physical address space using physical page numbers. The terms “virtual address” and “virtual address space” relate to the well-known concept of a virtual memory system, which should not be confused with the computer virtualization technology described elsewhere in this patent, involving other well-known concepts such as VMMs and VMs. A well-known technique of memory paging may be used to enable an application to use a virtual address space that is larger than the amount of physical memory that is available for use by the application. The code and data corresponding to some of the pages in the virtual address space may reside in physical memory, while other pages of code and data may be stored on a disk drive, for example. If the application attempts to access a memory location in the virtual address space for which the corresponding data is stored on the disk drive, instead of in physical memory, then the system software typically loads a page worth of data from the disk drive including the desired data into a page of physical memory (possibly first storing the contents of the memory page to disk). The system software then allows the attempted memory access to complete, accessing the physical memory page into which the data has just been loaded.
Now suppose that the host OS 220 of FIG. 1 implements a virtual memory system, with memory paging. This discussion ignores the topic of memory segmentation for now, as this topic is covered in the next section of this patent. Now if a guest application 260 requests access to the hardware memory 104, for example, the request is issued with a virtual address, which must be mapped to a physical address that is issued to the actual hardware memory. This mapping, or translation, is typically specified by the OS 220, which includes some form of memory management module 245 for this purpose. The OS thus converts the “virtual” address (VA), in particular, the virtual page number (VPN) of the request, into a “physical” address (PA), in particular, a physical page number (PPN), that can be applied directly to the hardware. (The VA and PA have a common offset from a base address, so that only the VPN needs to be converted into a corresponding PPN.)
When accessing a given memory location specified by a virtual address, the processor breaks the virtual address into a virtual page number (higher-order address bits) plus an offset into that page (lower-order address bits). The virtual page number (VPN) is then translated using mappings established by the OS into a physical page number (PPN) based on a page table entry (PTE) for that VPN in the page table associated with the currently active address space. The page table will therefore generally include an entry for every VPN. The actual translation may be accomplished simply by replacing the VPN (the higher order bits of the virtual address) with its PPN mapping, leaving the lower order offset bits the same.
To speed up virtual-to-physical address translation, a hardware structure known as a translation look-aside buffer (TLB) is normally included, for example, as part of a hardware memory management unit (MMU) 108. The TLB contains, among other information, VA-to-PA mapping entries at least for VPNs that have been addressed recently or frequently. Rather than searching the entire page table, the TLB is searched first instead. If the current VPN is not found in the TLB, then a “TLB miss” occurs, and the page tables in memory are consulted to find the proper translation, and the TLB is updated to include this translation. The OS thus specifies the mapping, but the hardware MMU 108 usually actually performs the conversion of one type of page number to the other. Below, for the sake of simplicity, when it is stated that a software module “maps” page numbers, the existence and operation of a hardware device such as the MMU 108 may be assumed.
The concepts of VPNs and PPNs, as well as the way in which the different page numbering schemes are implemented and used, are described in many standard texts, such as “Computer Organization and Design: The Hardware/Software Interface,” by David A. Patterson and John L. Hennessy, Morgan Kaufmann Publishers, Inc., San Francisco, Calif., 1994, pp. 579-603 (chapter 7.4 “Virtual Memory”). Patterson and Hennessy analogize address translation to finding a book in a library. The VPN is the “title” of the book and the full card catalog is the page table. A catalog card is included for every book in the library and tells the searcher where the book can be found. The TLB is then the “scratch” paper on which the searcher writes down the locations of the specific books he has previously looked up.
An extra level of addressing indirection is typically implemented in virtualized systems in that a VPN issued by an application 360 in the VM 300 is remapped twice in order to determine which page of the hardware memory is intended. A mapping module 345 within the guest OS 320 translates the guest VPN (GVPN) into a corresponding guest PPN (GPPN) in the conventional manner. The guest OS therefore “believes” that it is directly addressing the actual hardware memory, but in fact it is not. Of course, a valid address to the actual hardware memory address must, however, ultimately be used.
An address mapping module 445 in the VMM 400 therefore takes the GPPN issued by the guest OS 320 and maps it to a hardware page number PPN that can be used to address the hardware memory. From the perspective of the guest OS, the GVPN and GPPN are virtual and physical page numbers just as they would be if the guest OS were the only OS in the system. From the perspective of the actual host OS, however, the GPPN is a page number in the virtual address space, that is, a VPN, which is then mapped into the physical memory space of the hardware memory as a PPN. Note that in some literature involving virtualized systems, GVPNs, GPPNs, VPNs and PPNs are sometimes referred to as “VPNs,” “PPNs,” “VPNs” and “MPNs,” respectively, where “MPN” means “machine page number,” that is, the page number used to address the hardware memory. The problem is, though, that “VPN” is then used to mean the virtual page number in both the guest and host contexts, and one must always be aware of the current context to avoid confusion. Regardless of notation, however, the intermediate GPPN→PPN mapping performed by the VMM is transparent to the guest system, and the host OS need not maintain a GVPN→GPPN mapping.
These address mappings are illustrated in FIG. 3. The guest OS 320 generates a guest OS page table 313 that maps the guest software virtual address space to what the guest OS perceives to be the physical address space. In other words, the guest OS 320 maps GVPNs to GPPNs. Suppose, for example, that a guest application 360 attempts to access a memory location having a first GVPN, and that the guest OS has specified in the guest OS page table that the first GVPN is backed by what it believes to be a physical memory page having a first GPPN. The mapping from the first GVPN to the first GPPN is used by the virtual system hardware 301, and it is loaded into a virtual TLB (VTLB) 330.
The address mapping module 445 within the VMM 400 keeps track of mappings between the GPPNs of the guest OS 320 and the “real” physical memory pages of the physical memory 104 (see FIG. 1) within the system hardware 100. Thus, the address mapping module 445 maps GPPNs from the guest OS 320 to corresponding PPNs in the physical memory. Continuing the above example, the address mapping module translates the first GPPN into a corresponding PPN, let's say a first PPN.
The address mapping module 445 creates a shadow page table 413 that is used by the MMU 108 (see FIG. 1) within the system hardware 100. The shadow page table 413 includes a number of shadow PTEs that generally correspond to the PTEs in the guest OS page table 313, but the shadow PTEs map guest software virtual addresses to corresponding physical addresses in the actual physical memory 104, instead of to the physical addresses specified by the guest OS 320. In other words, while the guest OS page table 313 provides mappings from GVPNs to GPPNs, the shadow PTEs in the shadow page table 413 provide mappings from GVPNs to corresponding PPNs. Thus, continuing the above example, instead of containing a mapping from the first GVPN to the first GPPN, the shadow page table 413 may contain a shadow PTE that maps the first GVPN to the first PPN. Thus, when the guest application attempts to access a memory location having the first GVPN, the MMU 108 uses the mapping from the first GVPN to the first PPN in the shadow page table to access the corresponding memory location in the physical memory page having the first PPN. The MMU also loads the mapping from the first GVPN to the first PPN into a physical TLB 130 in the system hardware 100, if the mapping is not already in the TLB.
Segmented Memory
The best-selling virtualization products of VMware are designed for execution on a processor having the x86 architecture. Some of these VMware products based on the x86 architecture are used as specific examples for describing implementations of this invention. As a result, much of this description uses terminology and conventions of the x86 architecture. In particular, the privilege levels used in the x86 architecture are used throughout this description as a specific example of all such protection mechanisms. Thus, a privilege level of zero is used to indicate a most-privileged level, a privilege level of three is used to indicate a least-privileged level, with privilege levels of one and two indicating intermediate privilege levels, accordingly. Also, a privilege level of three is considered a user privilege level, while a privilege level of zero, one or two is considered a supervisor privilege level. The use of a single protection mechanism having a specific set of privilege levels as an example provides a simpler, more consistent description of the invention. However, the invention is not limited to implementations using the x86 architecture or implementations using similar protection mechanisms. The x86 architecture is described in numerous books and other references, including the IA-32 Intel Architecture Software Developer's Manual (the “IA-32 Manual”) from Intel Corporation. One aspect of the x86 architecture that is relevant to this invention is its implementation of memory segmentation. The invention also applies to other architectures that implement segmented memory, however.
The segmented memory implementation of the x86 architecture is illustrated in FIG. 4. As described in detail in the IA-32 Manual, a Global Descriptor Table Register (GDTR) 900 specifies a base address and a limit for a Global Descriptor Table (GDT) 908. The GDT begins in memory at the base address specified in the GDTR, which is illustrated in FIG. 4 by a line marked with a “B” (for base) extending between the GDTR 900 and the GDT 908. The GDT extends in memory to an address that is equal to the sum of the base address specified in the GDTR and the limit that is also specified in the GDTR. The upper limit of the GDT is illustrated in FIG. 4 by a line marked with a “B+L” (for base+limit) also extending between the GDTR 900 and the GDT 908. Corresponding lines, in FIG. 4 and in other drawings in this patent, show the extent of other data structures in memory, as defined by other base addresses and other limits, although these other lines are not marked with the labels “B” and “B+L,” respectively, in the other drawings for simplicity.
The GDT contains a number of segment descriptors, such as a first data descriptor 910, a second data descriptor 912 and a code descriptor 914. Each of the segment descriptors specifies a base address, a limit, protection characteristics and other attributes for a memory segment within a four gigabyte (GB) linear address space 916. Thus, for example, the first data descriptor 910 defines a stack segment 918 by specifying a first base address and a first limit, the second data descriptor 912 defines a data segment 920 by specifying a second base address and a second limit, and the code descriptor 914 defines a code segment 922 by specifying a third base address and a third limit.
The base addresses and the limits specified by the segment descriptors define the corresponding memory ranges included in the corresponding memory segments in the same manner as the base address and the limit specified by the GDTR 900 define the range of memory locations occupied by the GDT 908. The beginning address of the stack segment 918 is illustrated in FIG. 4 by a line marked with a “B” extending between the first data descriptor 910 and the stack segment 918, while the ending address of the stack segment is illustrated by a line marked with a “B+L” extending between the first data descriptor and the stack segment. Similarly, the beginning address of the data segment 920 is illustrated in FIG. 4 by a line marked with a “B” extending between the second data descriptor 912 and the data segment 920, while the ending address of the data segment is illustrated by a line marked with a “B+L” extending between the second data descriptor and the data segment. Also, the beginning address of the code segment 922 is illustrated in FIG. 4 by a line marked with a “B” extending between the code descriptor 914 and the code segment 922, while the ending address of the code segment is illustrated by a line marked with a “B+L” extending between the code descriptor and the code segment. Corresponding lines are used in other drawings in this patent to illustrate beginning and ending addresses for other memory segments, although the lines in the other drawings are not marked with the labels “B” and “B+L,” respectively, for simplicity.
The x86 architecture also includes a Local Descriptor Table Register (LDTR) that specifies a base address and a limit for a Local Descriptor Table (LDT). The LDTR and LDT are similar to the GDTR and the GDT and are described in detail in the IA-32 Manual. The description in this patent is restricted to using the GDTR and the GDT for simplicity, although it applies equally well to the use of the LDTR and the LDT.
The x86 architecture includes six segment registers that provide contemporaneous access to up to six memory segments. FIG. 4 shows a Stack Segment (SS) register 902, a Data Segment (DS) register 904 and a Code Segment (CS) register 906. The x86 architecture also includes ES, FS and GS data segment registers, which are not shown in FIG. 4 for simplicity. A segment selector is loaded into a segment register to provide access to a memory segment. The segment selector includes an index value, a table indicator and a Requested Privilege Level (RPL). The table indicator indicates whether the index value is applied to the GDT or to the LDT, and the index value selects a segment descriptor from the indicated descriptor table. For this description, the table indicator is assumed to indicate the GDT.
Each segment register includes a software-visible part that contains a segment selector and a hidden part that contains a segment descriptor. When a segment selector is loaded into the visible part of a segment register, the processor also loads the hidden part of the segment register with the base address, segment limit and access control information from the segment descriptor pointed to by the segment selector. After a segment register is loaded with a segment selector, the segment register contains all the information necessary to reference the selected memory segment.
To access a memory location within a memory segment, a segment register is first loaded with a segment selector, which points to a segment descriptor in a descriptor table, the segment descriptor defining the memory segment. Then, for the actual memory reference, the segment register is selected either implicitly or explicitly, and an offset into the memory segment is specified. The segment selector combined with the offset into the memory segment is referred to as a logical address in the IA-32 Manual. The sum of the base address of the memory segment and the offset into the memory segment gives a linear address in the linear address space 916. If memory paging is disabled, the linear address is also used as a physical address in a physical address space 926. Thus, with paging disabled, the linear address is applied directly to the memory 104 to perform a memory access.
If memory paging is enabled, then the linear address is mapped to a corresponding physical address in the physical address space 926 using a set of one or more page tables 924. The process of mapping a linear address to a physical address using the page tables 924 is substantially the same as described above, in the previous section of this patent, with the linear address being treated as a “virtual address” for purposes of that description. Thus, the page tables 924 contain PTEs that provide mappings from linear addresses to corresponding physical addresses, or, more specifically, from linear page numbers (LPNs) to corresponding physical page numbers (PPNs). The resulting physical address in the physical address space 926 is then applied to the memory 104 to perform the memory access.
The “linear address” terminology used in this description of segmented memory may be applied to the previous description of memory paging in a virtual computer system. Thus, referring again to FIG. 3, the guest OS 320 generates a guest OS page table 313 that maps the guest software linear address space 916 (FIG. 4) to what the guest OS perceives to be the physical address space. In other words, the guest OS 320 maps guest linear page numbers (GLPNs) to GPPNs. These mappings from GLPNs to GPPNs are also selectively loaded into the virtual TLB 330. The address mapping module 445 maps GPPNs from the guest OS 320 to corresponding PPNs in the physical memory. The address mapping module 445 creates a shadow page table 413 that is used by the MMU 108 (see FIG. 1) within the system hardware 100. The shadow page table 413 includes a number of shadow PTEs that generally correspond to the PTEs in the guest OS page table 313, but the shadow PTEs map guest software linear addresses to corresponding physical addresses in the actual physical memory 104, instead of to the physical addresses specified by the guest OS 320. In other words, while the guest OS page table 313 provides mappings from GLPNs to GPPNs, the shadow PTEs in the shadow page table 413 provide mappings from GLPNs to corresponding PPNs. These mappings from GLPNs to PPNs are also selectively loaded into the physical TLB 130 in the system hardware 100.
Performance of a Virtual Computer System
Speed is a critical issue in virtualization—a VM that perfectly emulates the functions of a given computer but that is too slow to perform needed tasks is obviously of little good to a user. Ideally, a VM should operate at the native speed of the underlying host system. In practice, even where only a single VM is installed on the host, it is impossible to run a VM at native speed, if for no other reason than that the instructions that define the VMM must also be executed. Near native speed, is possible, however, in many common applications.
The highest speed for a VM is found in the special case where every VM instruction executes directly on the hardware processor. This would in general not be a good idea, however, because the VM should not be allowed to operate at the greatest privilege level; otherwise, it might alter the instructions or data of the host OS or the VMM itself and cause unpredictable behavior. Moreover, in cross-architectural systems, one or more instructions issued by the VM may not be included in the instruction set of the host processor. Instructions that cannot (or must not) execute directly on the host are typically converted into an instruction stream that can. This conversion process is commonly known as “binary translation.”
U.S. Pat. No. 6,397,242 (Devine, et al., “Virtualization System Including a Virtual Machine Monitor for a Computer with a Segmented Architecture”, “the '242 patent”), which is incorporated herein by reference, describes a system in which the VMM includes a mechanism that allows VM instructions to execute directly on the hardware platform whenever possible, but that switches to binary translation when necessary. This allows for the speed of direct execution combined with the security of binary translation.
Accordingly, FIG. 1 shows a Direct Execution (DE) unit 460 and a Binary Translation (BT) unit 462. In the best-selling virtualization products of VMware, guest software that operates at user-level in the VM 300 (code that executes at a Current Privilege Level (CPL) of 3 in the x86 architecture) is generally executed directly on the system hardware 100 using the DE unit 460, while guest software that operates at a more-privileged level in the VM (privileged code executing at a CPL of 0, 1 or 2) is generally handled by the BT unit 462. However, as described below, in some circumstances, some guest software that executes at user-level in the VM 300 is handled by the BT unit 462, instead of the DE unit 460.
As described generally in the '242 patent, the direct execution of guest instructions involves setting up certain safeguards, such as memory traces and shadow descriptor tables, and then allowing guest instructions to execute directly on the system hardware 100. Under various circumstances, such as when the guest software issues a system call or when a memory trace is triggered, direct execution of guest instructions is suspended and control passes to the VMM 400. The VMM may emulate the execution of one or more guest instructions, such as through interpretation. Then, depending on the circumstances, the VMM may resume the direct execution of guest instructions, or it may switch over to binary translation, using the BT unit 462.
For binary translation, the BT unit 462 creates and maintains a translation cache within the memory of the VMM 400 that contains code translations for different sets of one or more guest instructions. When binary translation is to be used for a specific set of one or more guest instructions, the BT unit 462 first checks the translation cache for a translation that corresponds to the specific set of one or more guest instructions. If a corresponding translation cannot be found in the cache, then the BT unit 462 generates one. In either case, a corresponding code translation is ultimately executed by the BT unit. After executing one translation, the BT unit may jump to another translation, it may find another translation that corresponds to the next guest instruction(s) to be executed or it may generate a new translation corresponding to the next guest instructions. In this manner, the BT unit 462 may execute multiple translations during a single pass of binary translation.
At some point, however, the VMM 400 will stop executing translated instructions and return to the direct execution of guest instructions, using the DE unit 460, such as when the guest software in the VM 300 returns to the user-level. Thus, the VMM 400 switches back and forth between using the DE unit 460 to directly execute guest instructions and using the BT unit 462 to execute translations of guest instructions. Direct execution is generally used whenever possible for improved performance, but binary translation is used when necessary.
Another technique that is used in existing VMware products to improve performance is to have the VMM 400 share the linear address space of the guest software, including the guest OS 320 and one or more guest applications 360. The VMM 400 continuously shares the linear address space of whichever software is currently executing in the VM 300. During binary translation, memory accesses are made to the memory of both the guest software and the VMM 400. When generating translations, for example, the BT unit 462 accesses guest memory to read the instructions that are to be translated, and it accesses VMM memory to store the translations in the translation cache. More importantly, when executing instructions from the translation cache, accesses are typically also made to data in the guest memory, in addition to data and the instructions from the VMM memory. If the VMM 400 were to maintain a separate address space from the guest software, a change in address spaces would be required each time the VMM 400 switched between accessing guest data and VMM data. As is well known, switching address spaces generally takes a considerable amount of time with the x86 architecture, as well as with other architectures. As a result, the continual switching of address spaces that would be required in binary translation if the VMM were to use a separate address space would dramatically slow down the operation of binary translation.
In addition, the emulation of guest instructions by the VMM 400, such as through interpretation, generally also requires access to the memory of both the VMM and the guest software. Accordingly, if separate address spaces were maintained, transitions from the direct execution of guest instructions to the emulation of guest instructions by the VMM would also be substantially slowed.
As described above, however, the VMM 400 is preferably transparent to the VM software, including the guest software. So the VMM preferably shares the address space of the guest software, without the knowledge of the guest software, and yet the VMM memory must be protected from the guest software. In the virtualization products of VMware described above, the memory segmentation mechanism is used to protect the VMM memory from guest software.
Protection of VMM using Memory Segments
The protection mechanism used in the VMware products described above is illustrated in FIG. 5A. As described above, the virtual system hardware 301 is a virtualization of a complete computer system. In particular, the virtual system hardware includes a VCPU 302, which is a virtualization of a complete, physical processor. In these VMware products, the VCPU 302 also has the x86 architecture. Thus, the VCPU 302 includes a virtual GDTR (V-GDTR) 900V, a virtual CS register (V-CS) 906V and a virtual DS register (V-DS) 904V, as illustrated in FIG. 5A. These virtual registers function in substantially the same manner as the respective physical registers described above, namely the GDTR 900, the CS register 906 and the DS register 904, which are also illustrated in FIG. 5A.
The guest OS 320 creates a Global Descriptor Table in a conventional manner, which is referred to as a guest Global Descriptor Table (G-GDT) 908G. The guest OS 320 then fills the guest GDT 908G with segment descriptors in a conventional manner, such as a guest code descriptor 914G and a guest data descriptor 912G. As described above, each of the segment descriptors defines a memory segment by specifying a base address and a limit for the memory segment, along with other segment properties. Thus, for example, the guest code descriptor 914G defines a guest code segment 922G within a guest linear address space 916V and the guest data descriptor 912G defines a guest data segment 920G in the same address space 916V. The beginning and ending addresses of the guest code segment and the guest data segment, defined by the respective base addresses and limits, are indicated in FIG. 5A using dashed lines extending between the respective descriptors and memory segments. Thus, the guest code segment 922G is made up of a first code segment portion 922V and a second code segment portion 922W, while the guest data segment 920G is made up of a first data segment portion 920V, a second data segment portion 920W and a third data segment portion 920X.
The guest OS 320 also activates the guest GDT 908G within the VM 300 by loading the virtual GDTR 900V with a base address and a limit that correspond to the guest GDT 908G, as illustrated in FIG. 5A by the two lines extending between the virtual GDTR and the guest GDT. The guest OS 320 may also load segment selectors into the segment registers of the VM 300 to activate the corresponding memory segments. For example, as illustrated in FIG. 5A, the guest OS 320 may load a segment selector for the guest code descriptor 914G into the virtual CS 906V to select the guest code segment 922G for instruction fetches, and the guest OS 320 may load a segment selector for the guest data descriptor 912G into the virtual DS 904V to select the guest data segment 920G for data accesses. Of course, the guest OS 320 may also load additional segment descriptors into the guest GDT 908G to define additional memory segments and select additional memory segments for use by loading appropriate segment selectors into the other segment registers.
As described in the '242 patent, however, the system hardware 100 does not access memory segments based on the guest GDT 908G. Instead, the VMM 400 creates a separate, shadow Global Descriptor Table (S-GDT) 908S, as illustrated in FIG. 5A, and loads the hardware GDTR 900 with a base address and limit that correspond to the shadow GDT 908S. Thus, the system hardware 100 accesses memory segments based on the shadow GDT 908S, instead.
As also described in the '242 patent, the VMM 400 loads the shadow GDT 908S with “cached descriptors,” “VMM descriptors” and “shadow descriptors.” The cached descriptors correspond with the segment descriptors that are loaded into the segment registers of the VM 300 to emulate the segment-caching properties of the x86 architecture. The VMM descriptors are for use by the VMM 400 to access its own memory.
The shadow descriptors, on the other hand, are derived from the guest segment descriptors in the guest GDT 908G. Thus, for example, the shadow GDT 908S may contain a shadow code descriptor 914T that is derived from the guest code descriptor 914G and a shadow data descriptor 912T that is derived from the guest data descriptor 912G. The VMM 400 also puts a memory write trace on the guest GDT 908G, so that the VMM 400 can intercept any guest instruction that attempts to modify a guest segment descriptor in the guest GDT. The VMM 400 can then modify both the guest segment descriptor in the guest GDT and a corresponding shadow descriptor in the shadow GDT in accordance with the guest instruction.
The VMM 400 may also load the physical segment registers with segment selectors to select corresponding memory segments for use. The guest software may also load segment selectors into the physical segment registers, with certain limitations, as described in greater detail below, which will select corresponding memory segments as defined by segment descriptors in the shadow GDT 908S. For example, the CS register 906 may be loaded with a segment selector for the shadow code descriptor 914T, and the DS register 904 may be loaded with a segment selector for the shadow data descriptor 912T, as illustrated in FIG. 5A.
As described in the '242 patent, each of the guest segment descriptors in the guest GDT 908G is generally copied into a corresponding shadow segment descriptor in the shadow GDT 908S, but with a few possible modifications. For example, in generating shadow descriptors from corresponding guest descriptors, the VMM 400 may change the Descriptor Privilege Level (DPL) of some of the descriptors. In particular, if a guest descriptor has a DPL of 0, the VMM of the described embodiment sets the DPL of the corresponding shadow descriptor to 1, so that the shadow descriptor may be loaded into a segment register when binary translation is run at a CPL of 1. The VMM 400 may also disable callgates. Another possible modification involves truncating the memory segment defined by the guest OS 320 to protect the VMM memory.
FIG. 5A shows a VMM memory 930 occupying the upper-most portion of the linear address space 916V of the guest software. In the virtualization products of VMware described above, the VMM memory occupies the top four MB of the four GB linear address space of the guest software. As defined by the guest OS 320 in the guest code descriptor 914G, the guest code segment 922G extends from the bottom of the first code segment portion 922V to the top of the second code segment portion 922W, while the guest data segment 920G extends from the bottom of the first data segment portion 920V, through the second data segment portion 920W, to the top of the third data segment portion 920X. Thus, both the second code segment portion 922W and the second data segment portion 920W coincide with the VMM memory 930 in the linear address space 916V.
If the guest software were allowed to access the linear address space corresponding to the second code segment portion 922W and the second data segment portion 920W, the VMM memory 930 could become corrupted. The VMM 400 cannot allow this to happen. In deriving the shadow code descriptor 914T from the guest code descriptor 914G, the VMM 400 copies most of the data from the guest code descriptor, including the base address for the memory segment 922G, into the shadow code descriptor. However, instead of simply copying the limit from the guest code descriptor 914G, the VMM 400 sets the limit in the shadow code descriptor 914T to a value that indicates the top of the first code segment portion 922V, as illustrated in FIG. 5A. Thus, while the guest code segment 922G includes the two code segment portions 922V and 922W, the code segment defined by the shadow code descriptor 914T, which is actually used by the system hardware 100, includes only the first code segment portion 922V. Similarly, the VMM 400 copies the base address and other data from the guest data descriptor 912G into the shadow data descriptor 912T, but sets the limit in the shadow data descriptor to a value that indicates the top of the first data segment portion 920V, as also illustrated in FIG. 5A. Thus, the VMM 400 truncates the guest code segment 922G at the top of the first code segment portion 922V to create a truncated code segment 922T and it truncates the guest data segment 920G at the top of the first data segment portion 920V to create a truncated data segment 920T. If a guest memory segment does not extend into the region of the linear address space 916V that is occupied by the VMM memory 930, however, then the memory segment need not be truncated when generating a corresponding shadow segment descriptor.
The VMM 400 sets the Descriptor Privilege Level (DPL) of all cached descriptors and all VMM descriptors to a privileged level, such as a DPL of 1 in the x86 architecture. As described above, direct execution is used only for user-level code, which cannot load a segment descriptor that has a DPL of 0, 1 or 2. Thus, during direct execution, guest software cannot load any cached descriptors or VMM descriptors. The only segment descriptors that can be loaded during direct execution are shadow descriptors that have a DPL of 3.
All shadow descriptors are truncated, if necessary, to protect the VMM memory 930. Therefore, during direct execution, the guest software cannot load a segment descriptor that includes any of the linear address space that is occupied by the VMM memory 930. Also, any segment registers that contain VMM descriptors are loaded with appropriate shadow descriptors before the VMM transfers control to direct execution, so that guest software has no access to any VMM descriptors during direct execution. Thus, the user-level guest software may be safely executed directly on the system hardware, and it may be allowed to load segment descriptors from the shadow GDT 908S, without putting the VMM memory 930 at risk.
Referring again to FIG. 5A, suppose that the guest software is being directly executed on the system hardware 100 and the guest software attempts to use the DS register 904 to access a memory location within the second data segment portion 920W. In this case, because the memory location is not within the truncated data segment 920T defined by the shadow descriptor 912T, a general protection fault occurs, which transfers control to the VMM 400. The VMM 400 then emulates the guest instruction that attempted to access the second data segment portion 920W, accessing the appropriate guest memory location, instead of allowing access to a location within the VMM memory 930. After emulating the guest instruction, the VMM 400 may resume the direct execution of guest instructions. As long as the VMM 400 emulates the instructions correctly, the guest software will not be able to determine that it does not have direct access to the entire linear address space 916V.
As described above, during binary translation the BT unit 462 accesses both VMM memory and guest memory. In particular, some instructions in the translations in the translation cache will access VMM memory, while other instructions in the translations attempt to access guest memory. Memory accesses that are intended for VMM memory will be referred to as VMM accesses, while attempted memory accesses that are intended for guest memory are referred to as guest accesses. Although the instructions in the translations in the translation cache are generated by the BT unit 462, the specification of addresses for guest accesses is dependent on guest data. The BT unit 462 does not pre-screen the addresses that are generated for these guest accesses. Therefore, when executing instructions from the translation cache, guest accesses may be directed to the region of the linear address space 916V that is occupied by the VMM memory 930. For example, an instruction from the translation cache may cause an attempted memory access to a memory location within the second data segment portion 920W. Again, the VMM 400 must not allow such guest accesses to reach the VMM memory. At the same time, however, VMM accesses must be allowed to reach the VMM memory.
In earlier VMware products based on the x86 architecture, the BT unit 462 always executes as privileged code, at a CPL of 1. For now, for simplicity, this description assumes that the BT unit 462 executes only at a CPL of 1. As described below, however, in more recent VMware products the BT unit 462 sometimes also executes at a CPL of 3. When the BT unit 462 executes at a CPL of 1, the BT unit can generally load a segment register with a shadow descriptor, which allows the BT unit to access guest memory, or with a VMM descriptor, which allows the BT unit to access VMM memory. In the VMware products described above, the BT unit loads some of the segment registers with VMM descriptors to provide access to the VMM memory 930, and it loads one or more other segment registers with shadow descriptors to provide contemporaneous access to the guest memory. The BT unit (and more generally the VMM 400) uses cached descriptors to virtualize the segment-caching properties of the x86 architecture. The following descriptions are limited to shadow descriptors for simplicity, although they generally also apply to cached descriptors. When the BT unit 462 generates a translation for a set of one or more guest instructions, instructions that require VMM accesses use the segment registers containing VMM descriptors, while instructions that require guest accesses use the segment registers containing shadow descriptors. For example, the GS register may be loaded with a VMM descriptor and the DS register may be loaded with a shadow descriptor. Then, for an instruction that requires a VMM access, the BT unit may explicitly reference the GS register using a segment override prefix, while for an instruction that requires a guest access, the BT unit may implicitly reference the DS register. Thus, VMM accesses use memory segments that include the VMM memory 930, while guest accesses use memory segments that are truncated, if necessary, to exclude the VMM memory. Again, if a guest access references a linear address that is within the guest memory segment, but which is not within the truncated memory segment, a general protection fault arises and the VMM 400 gains control and emulates the guest instruction. The VMM may then return to binary translation.
Using memory segmentation to protect the VMM memory 930 as described above allows the VMM 400 to safely share the linear address space 916V of the guest software, without the guest software knowing that the address space is being shared. The VMM 400 is able to access the entire linear address space 916V, including both guest memory and VMM memory, while the guest software is prevented from accessing the VMM memory 930.
Responding to general protection faults that are caused by the truncation of guest memory segments and emulating the instructions that give rise to the faults slows down the operation of the virtual computer system, in comparison to a comparable physical computer system that does not require segment truncation. However, as long as the region of the linear address space that is occupied by the VMM memory is not used very often by the guest software, the performance gains of sharing the linear address space of the guest software far outweigh the costs of handling the faults. When the VMware products described above were developed, the most important OSs for the x86 architecture did not make much use of the upper-most 4 MB of their linear address spaces. So placing the VMM memory in this region of the address space and using the segmented memory protection mechanism described above was seen as an efficient and effective method for allowing the VMM 400 to safely and transparently share the linear address space of the guest software.
However, the protection mechanism described above is not completely efficient in all circumstances. This can be seen by referring to FIG. 5A, and comparing the guest memory segments 922G and 920G along with the corresponding truncated memory segments 922T and 920T. First, comparing the guest code segment 922G with the truncated code segment 922T shows that the second code segment portion 922W is not part of the truncated code segment, but it is part of the guest code segment. Any guest access to the second code segment portion 922W will result in a general protection fault and an emulation of the instruction that prompted the guest access. Any such guest access must be blocked, however, to protect the VMM memory 930, which completely coincides with the second code segment portion 922W. The truncation of the guest code segment 922G is completely efficient in the sense that all guest accesses that must be blocked to protect the VMM memory are blocked, and no guest accesses are blocked that don't need to be blocked.
In this same sense, the truncation of the guest data segment 920G is not completely efficient, though. The second data segment portion 920W, which is part of the guest data segment, is not part of the truncated data segment 920T, so that guest accesses to the second data segment portion are blocked. This aspect of the truncation is completely efficient because the second data segment portion coincides completely with the VMM memory 930. However, the third data segment portion 920X, which is also part of the guest data segment 920G, is also not part of the truncated data segment 920T, so that guest accesses to the third data segment portion are also blocked. But the third data segment portion does not coincide at all with the VMM memory 930. There is no need to block guest accesses to this portion, but they are blocked nonetheless. The truncation of the guest data segment 920G gives rise to general protection faults, and the resulting emulation of guest instructions for access to the third data segment portion 920X, even though such accesses pose no risk to the VMM memory 930.
This inefficiency results from the fact that the guest data segment 920G extends through and beyond the region of the linear address space that is occupied by the VMM memory 930. In this case, the guest data segment wraps around the top of the linear address space 916V, extending up to the top of the address space and continuing through to the bottom portion of the address space. In this embodiment, with the VMM memory occupying the top of the linear address space, any guest memory segment that wraps around the top of the linear address space 916V, such as the guest data segment 920G, will lead to inefficiencies in the sense described above. A memory segment can only wrap around the top of the linear address space if it has a non-zero base. As mentioned above, the OSs that were most important when the earlier VMware products were developed made very little use of the top 4 MB of their linear address spaces. Memory segments with non-zero bases were even less common, so it was very uncommon for a memory segment to wrap around the top of the address space, causing the inefficiency described above. Therefore, again, the segmented memory protection mechanism described above was an efficient, effective method to allow the VMM to share the linear address space of the guest software.
Recent changes to the Linux OS, however, have increased that OS's use of the upper 4 MB of its address space and the changes have increased the use of memory segments with non-zero bases that wrap around the top of the address space. As a result, the segmented memory protection mechanism described above is not as efficient for the newer versions of Linux as it is for older versions of Linux.
One recent change to Linux that leads to inefficiencies in the protection mechanism involves the adoption of the Native POSIX (Portable Operating System Interface for Unix) Thread Library (NPTL). The purpose of the NPTL is to improve the performance of threaded applications on the Linux OS. With the NPTL, all of the threads of an application share a single linear address space, but each thread has its own instruction pointer, register set and stack. A separate portion of the address space is set aside for use as a stack for each of the threads of an application. Each thread typically also uses some memory for local storage, which is often used both by the NPTL and by application code.
In other architectures, the NPTL allocates different registers to point to local storage for different threads of an application. In the x86 architecture, however, because of the limited number of general purpose registers available, the NPTL uses memory segmentation to distinguish between the local storage of the multiple threads in an application. Specifically, a different segment descriptor is created for each thread, with each descriptor defining a memory segment with a different base address and a 4 GB limit. The local storage for each thread is located at and around the base address of the respective memory segment. The GS register is loaded with different segment selectors to select the different segment descriptors to allow each thread to access its own local storage, using its own memory segment. Each thread can access its own memory segment, when its segment descriptor is loaded into the GS register, by simply applying a segment override prefix to instructions to cause a reference to the GS register.
The memory segments for local storage for all threads, except possibly one, wrap around the top of the address space, because they have non-zero base addresses and a 4 GB limit. Also, the NPTL specification allows the thread local storage to be accessed using both positive and negative offsets from the base address. If a new version of Linux is used as a guest OS 320, every time a guest access uses a negative offset to access thread local storage in a memory segment that wraps around the top of the linear address space, segment truncation will cause a general protection fault and the instruction will need to be emulated. Most of the time in these situations, the linear address that is being referenced will not be in the same region of the linear address space 916V as the VMM memory 930. The VMM 400 will truncate the memory segments for the thread local storage to protect the VMM memory 930, but the truncation will block many guest accesses that do not put the VMM memory at risk.
This predicament is generally illustrated in FIG. 5B. FIG. 5B shows the linear address space 916V of the guest software, including the VMM memory 930. FIG. 5B also shows a set of seven exemplary guest data segments, defined by guest segment descriptors, along with a set of seven corresponding data segments that would be created by the protection mechanism described above. A first data segment 940 has a non-zero base address and a limit such that the data segment 940 does not extend into the region of the linear address space 916V that is occupied by the VMM memory 930. Thus, there is no need to truncate the first data segment 940. The limit of the guest segment descriptor for the data segment 940 is copied directly into the corresponding shadow descriptor. A second data segment 942 has a base address of zero and a limit such that, again, the data segment 942 does not extend into the region of the address space that is occupied by the VMM memory. There is no need to truncate the second data segment 942 either, when creating a corresponding shadow descriptor.
A third data segment 944, which is a so-called “flat” segment, has a base address of zero and extends the entire 4 GB of the linear address space 916V. The third data segment 944 comprises a first data segment portion 944A that does not coincide with the VMM memory 930 and a second data segment portion 944B that does coincide with the VMM memory. Under the protection mechanism described above, the third data segment 944 is truncated at the top of the first data segment portion 944A, so that the truncated data segment includes only the first data segment portion 944A, and not the second data segment portion 944B. The first, second and third data segments 940, 942 and 944 do not lead to inefficiencies in the protection mechanism because only guest accesses that need to be blocked are, in fact, blocked.
FIG. 5B also shows four data segments that do lead to inefficiencies in the protection mechanism. A fourth data segment 946, a fifth data segment 947, a sixth data segment 948 and a seventh data segment 949 all have different base addresses and a 4 GB limit. These four data segments, along with the third data segment 944, are representative of the type of data segments created by the NPTL for local storage for different threads of an application. Thus, the third data segment 944 might be for local storage for a first thread of an application, the fourth data segment 946 might be for local storage for a second thread of the application, the fifth data segment 947 might be for local storage for a third thread of the application, the sixth data segment 948 might be for local storage for a fourth thread of the application, and the seventh data segment 949 might be for local storage for a fifth thread of the application.
Each of the four data segments 946, 947, 948 and 949 includes three data segment portions, a first of which occupies the address space between the base address of the respective data segment and the base address of the VMM memory 930, a second of which coincides completely with the VMM memory, and a third of which extends from a linear address of zero back up to the base address of the respective data segment. Thus, the fourth data segment 946 comprises a first data segment portion 946A, a second data segment portion 946B and a third data segment portion 946C; the fifth data segment 947 comprises a first data segment portion 947A, a second data segment portion 947B and a third data segment portion 947C; the sixth data segment 948 comprises a first data segment portion 948A, a second data segment portion 948B and a third data segment portion 948C; and the seventh data segment 949 comprises a first data segment portion 949A, a second data segment portion 949B and a third data segment portion 949C.
Each of the first data segment portions 946A, 947A, 948A and 949A covers the same region of the linear address space 916V as the corresponding truncated data segment covers under the above protection mechanism. Thus, guest accesses in these first data segment portions are not blocked under the above protection mechanism. Each of the second data segment portions 946B, 947B, 948B and 949B covers the region of the address space that is occupied by the VMM memory 930. These second data segment portions are not included in the truncated data segments, so guest accesses to these second data segment portions are blocked under the above protection mechanism. This blocking of guest accesses does not lead to inefficiencies in the protection mechanism, because the guest accesses must be blocked to protect the VMM memory. Each of the third data segment portions 946C, 947C, 948C and 949C covers a region of the linear address space 916V that is not included in the corresponding truncated data segment, but which does not coincide with the VMM memory 930. Any guest access to one of these third data segment portions will be blocked by the above protection mechanism, even though these guest accesses do not pose any risk to the VMM memory. Thus, these third data segment portions represent possible inefficiencies in the sense described above, for the above protection mechanism.
If the data segments 946, 947, 948 and 949 represent memory segments for local storage for different threads of an application under the NPTL, then any attempted access to these memory segments using a negative offset is an attempted access to the corresponding third data segment portion 946C, 947C, 948C and 949C. Thus, any such attempted access would be blocked by the above protection mechanism, even though it does not pose a risk to the VMM memory 930. Depending on the programming of particular applications, such as whether or not the applications are programmed to use negative offsets to access local storage for threads, threaded applications that run under the NPTL of the new Linux OSs may cause substantial inefficiencies in the operation of the above protection mechanism due to a substantial number of unnecessary general protection faults, followed by the unnecessary emulation of instructions.
A second change that has been made to newer versions of Linux, and which leads to inefficiencies in the operation of the above protection mechanism, involves the introduction of a “vsyscall” form of system calls. Older versions of Linux have implemented system calls using a software interrupt instruction (INT 80). Newer processors, however, provide special instructions that yield improved performance for system calls. The x86 architecture, for example, has introduced the instructions SYSENTER and SYSEXIT for this purpose. Linux developers naturally wanted to take advantage of the improved performance of these new instructions, but they also wanted to ensure that newer versions of Linux still work on older versions of processors that do not implement these instructions.
The Linux developers modified the kernel so that the kernel maps a single page in the kernel address space as a user-readable “vsyscall” page. If the kernel determines that it is running on a processor that implements the new system call instructions, the kernel adds a system call routine to the vsyscall page that uses the SYSENTER instruction. If, on the other hand, the kernel determines that the processor on which it is running does not implement the new system call instructions, the kernel adds a system call routine to the vsyscall page that uses the INT 80 instruction. Using this technique, user code can make a system call by simply calling to a particular location in the vsyscall page. The vsyscall page is set up to take advantage of the new system call instructions if they are supported by the particular processor, or to use the software interrupt if the new instructions are not supported.
Unfortunately, the Linux developers decided to place the vsyscall page on the second to last page in the linear address space, which is within the region of the linear address space that is occupied by the VMM 400 in the VMware products described above. Thus, the above protection mechanism causes a general protection fault every time the guest software makes a system call. In addition, the CPL change code that is used during a system call for switching from user mode to a more-privileged CPL and for switching from a privileged CPL back to user mode is also placed on the vsyscall page. When switching back to user mode from supervisor mode, a few instructions are executed in the vsyscall page after the CPL has changed to a value of 3. The VMM 400 cannot execute these instructions directly on the system hardware because the protection mechanism would generate faults. So the VMM 400 is not able to switch back to direct execution as soon as the guest software returns to a CPL of 3. Instead, the VMM 400 might remain in binary translation mode until execution leaves the vsyscall page.
In the earlier VMware products in which the BT unit 462 always executes at a CPL of 1, when binary translation is used for guest code that executes at a CPL of 3, the translated code should not be allowed to access guest memory that requires a supervisor privilege level. However, because the translated code is executed at a CPL of 1, it will be able to access both user privilege level and supervisor privilege level memory pages (privilege level settings for memory pages are described in greater detail below). In these earlier VMware products, a separate user-level shadow page table is maintained that includes shadow PTEs for memory pages that are accessible with a user privilege level, but it does not include any shadow PTEs that correspond with guest PTEs that require a supervisor privilege level. When this user-level shadow page table is used, guest accesses are only allowed to access user privilege level memory, which is appropriate, because the guest software is supposed to be executing at a CPL of 3. Thus, in these earlier VMware products, when the BT unit 462 switches from executing code that corresponds with supervisor-level guest software to executing code that corresponds with user-level guest software, the normal shadow page table that includes shadow PTEs for both user privilege level memory and supervisor privilege level memory must be replaced by the user-level shadow page table that only includes PTEs for user privilege level memory, and the TLB must be flushed, to ensure that the user-level guest software is not able to access supervisor privilege level memory.
In view of the recent changes to the Linux OS, if a newer version of the OS is running as the guest OS 320 in a virtual computer system, there will be substantially more guest accesses to the upper 4 MB of the linear address space of the guest software than there would be if the VM 300 were running an older version of Linux. This will lead to an increased number of general protection faults and emulations of instructions when using the above protection mechanism to safeguard the VMM memory 930. In addition, there is likely to be a substantially greater number of guest accesses that cause a general protection fault and an emulation of the guest instruction, even when the guest access does not pose a risk to the VMM memory 930, due to memory segments that wrap around the top of the linear address space. The added faults and resulting emulation of instructions may significantly slow down the operation of the virtual computer system. What is needed therefore is a protection mechanism that allows a VMM to safely and transparently share a linear address space of a guest, but which is more efficient for OSs that make increased use of portions of the upper 4 MB of their linear address space and that use more memory segments that wrap around the top of the linear address space. This invention provides such a mechanism.