1. Field of the Invention
This invention relates to a computer virtualization system and a related method of operation, in particular in the context of one or more virtual machines running on a virtual machine monitor, which in turn is running on underlying hardware with a segmented architecture.
2. Description of the Related Art
The operating system plays a special role in today""s personal computers and engineering work stations. Indeed, it is the only piece of software that is typically ordered at the same time the hardware itself is purchased. Of course, the customer can later change operating systems, upgrade to a newer version of the operating system, or even re-partition the hard drive to support multiple boots. In all cases, however, a single operating system runs at any given time on the computer. As a result, applications written for different operating systems cannot run concurrently on the system.
Various solutions have been proposed to solve this problem and eliminate this restriction. These include virtual machine monitors, machine simulators, application emulators, operating system emulators, embedded operating systems, legacy virtual machine monitors, and boot managers.
Virtual machine monitors (VMM""s) were the subject of intense research in the late 1960""s and 1970""s. See, for example, R. P. Goldberg, xe2x80x9cSurvey of virtual machine research,xe2x80x9d IEEE Computer, Vol. 7, No. 6, 1974. During that time, moreover, IBM Corp. adopted a virtual machine monitor for use in its VM/370 system.
A virtual machine monitor is a thin piece of software that runs directly on top of the hardware and virtualizes all, or at least some subset of, the resources of the machine. Since the exported interface is the same as the hardware interface of the machine, the operating system cannot determine the presence of the VMM. Consequently, when the hardware interface is compatible with the underlying hardware, the same operating system can run either on top of the virtual machine monitor or on top of the raw hardware.
Virtual machine monitors were popular at a time where hardware was scarce and operating systems were primitive. By virtualizing all the resources of the system, such prior art VMMs made it possible for multiple independent operating systems to coexist on the same machine. For example, each user could have her own virtual machine running a single-user operating system.
The research in virtual machine monitors also led to the design of processor architectures that were particularly suitable for virtualization. It allowed virtual machine monitors to use a technique known as xe2x80x9cdirect execution,xe2x80x9d which simplifies the implementation of the monitor and improves performance. With direct execution, the VMM sets up the processor in a mode with reduced privileges so that the operating system cannot directly execute its privileged instructions. The execution with reduced privileges generates traps, for example when the operating system attempts to issue a privileged instruction. The VMM thus needs only to correctly emulate the traps to allow the correct execution of the operating system in the virtual machine.
As hardware became cheaper and operating systems more sophisticated, VMM""s based on direct execution began to lose their appeal. Recently, however, they have been proposed to solve specific problems. For example, the Hypervisor system provides fault-tolerance, as is described by T. C. Bressoud and F. B. Schneider, in xe2x80x9cHypervisor-based fault tolerance,xe2x80x9d ACM Transactions on Computer Systems (TOCS), Vol. 14. (1), February 1996; and in U.S. Pat. No. 5,488,716 xe2x80x9cFault tolerant computer system with shadow virtual processor,xe2x80x9d (Schneider, et al.). As another example, the Disco system runs commodity operating systems on scalable multiprocessors. See xe2x80x9cDisco: Running Commodity Operating Systems on Scalable Multiprocessors,xe2x80x9d E. Bugnion, S. Devine, K. Govil and M. Rosenblum, ACM Transactions on Computer Systems (TOCS), Vol.15, No. 4, November 1997, pp. 412-447.
Virtual machine monitors can also provide architectural compatibility between different processor architectures by using a technique known as either xe2x80x9cbinary emulationxe2x80x9d or xe2x80x9cbinary translation.xe2x80x9d In these systems, the VMM cannot use direct execution since the virtual and underlying architectures mismatch; rather, they must emulate the virtual architecture on top of the underlying one. This allows entire virtual machines (operating systems and applications) written for a particular processor architecture to run on top of one another. For example, the IBM DAISY system has recently been proposed to run PowerPC and x86 systems on top of a VLIW architecture. See, for example, K. Ebcioglu and E. R. Altman, xe2x80x9cDAISY: Compilation for 100% Architectural Compatibility,xe2x80x9d Proceedings of the 24th International Symposium on Computer Architecture, 1997.
All of the systems described above are designed to allow applications designed for one version or type of operating system to run on systems with a different version or type of operating system. As usual, the designer of such a system must try to meet different requirements, which are often competing, and sometimes apparently mutually exclusive.
Virtual machine monitors (VMM) have many attractive properties. For example, conventional VMMs outperform machine emulators since they run at system level without the overhead and constraint of an existing operating system. They are, moreover, more general than application and operating system emulators since they can run any application and any operating system written for the virtual machine architecture. Furthermore, they allow modern operating systems to coexist, not just the legacy operating systems that legacy virtual machine monitors allow. Finally, they allow application written for different operating systems to time-share the processor; in this respect they differ from boot managers, which require a complete xe2x80x9cre-boot,xe2x80x9d that is, system restart, between applications.
As is the typical case in the engineering world, the attractive properties of VMMs come with corresponding drawbacks. A major drawback is the lack of portability of the VMM itselfxe2x80x94conventional VMMs are intimately tied to the hardware that they run on, and to the hardware they emulate. Also, the virtualization of all the resources of the system generally leads to diminished performance.
As is mentioned above, certain architectures (so-called xe2x80x9cstrictly virtualizeablexe2x80x9d architectures), allow VMMs to use a technique known as xe2x80x9cdirect executionxe2x80x9d to run the virtual machines. This technique maximizes performance by letting the virtual machine run directly on the hardware in all cases where it is safe to do so. Specifically, it runs the operating system in the virtual machine with reduced privileges so that the effect of any instruction sequence is guaranteed to be contained in the virtual machine. Because of this, the VMM must handle only the traps that result from attempts by the virtual machine to issue privileged instructions.
Unfortunately, many current architectures are not strictly virtualizeable. This may be because either their instructions are non-virtualizeable, or they have segmented architectures that are non-virtualizeable, or both. Unfortunately, the all-but-ubiquitous Intel x86 processor family has both of these problematic properties, that is, both non-virtualizeable instructions and non-reversible segmentation. Consequently, no VMM based exclusively on direct execution can completely virtualize the x86 architecture.
Complete virtualization of even the Intel x86 architecture using binary translation is of course possible, but the loss of performance would be significant. Note that, unlike cross-architectural systems such as DAISY, in which the processor contains specific support for emulation, the Intel x86 was not designed to run a binary translator. Consequently, no conventional x86-based system has been able to successfully virtualize the Intel x86 processor itself.
The parent applicationxe2x80x94U.S. patent application Ser. No. 09/179,137xe2x80x94discloses a system in which one or more virtual machines (VM""s) run on a virtual machine monitor, which in turn is installed on hardware with a segmented architecture, such as the well-known and widely used Intel x86 architecture. In the preferred, albeit not required, configuration, the VMM is installed at system level along with an existing, host operating system. This configuration, which is disclosed in the co-pending U.S. patent application Ser. No. 09/151,175 (xe2x80x9cSystem and Method for Virtualizing Computer Systemsxe2x80x9d), enables the VMM to allow the host operating system itself to manage certain hardware resources required by a VM and thereby to increase speed.
As is well known, in order to provide the operating system with a flexible mechanism to isolate and protect memory areas serving different purposes, the processors in such architectures include various segment registers. For example, a program""s code and data may be placed in two different segments to prevent an erroneous data access from accidentally modifying code. A xe2x80x9cdescriptorxe2x80x9d is a structure in memory that is included in these architectures and that defines the base, limit, and protection attributes of a segment. These values are then stored in special registers of the processor. The segments, and thus the descriptors, generally include not only a visible part, which can be accessed by software once loaded, but also a hidden part, which cannot. Segment registers, in particular, their hidden state, improve performance because they are cached inside of the processor. Without them, each memory access would require the processor to read a segment descriptor from memory to determine the segment""s base, limit, and protections. This would be very slow because the descriptor itself is stored in memory.
In the context of virtualization, one problem that this leads to is that the state of a segment (descriptor) loaded in the appropriate register of the hardware processor may be non-reversible. Here, this means that it cannot be reconstructed once the descriptor in the VM memory has been modified, inasmuch as the architecture does not provide an instruction to save the contents of the hidden state to a descriptor in memory.
In the prior art, to the extent that the issue was addressed or even recognized at all, non-reversiblility meant that the virtualization of the hardware architecture would be either incomplete or impossible. The invention described in the parent application solves this problem by providing for the VMM to allow the VM to run using faster direct execution as long as possible, but to switch to binary translation whenever a VM action leads, among other possibilities, to non-reversiblility.
The parent application makes this possible in part by including, in the VMM memory space, different types of xe2x80x9ccopiesxe2x80x9dxe2x80x94shadow and cachedxe2x80x94of the VM descriptors. Cached descriptors emulate the segment-caching properties of the architecture itself, whereas shadow copies correspond to the VM""s stored list of descriptors, but with slight modifications, such as altered privilege levels.
The primary purpose of a cached descriptor is to emulate the hidden segment state of a segment register in an x86 virtual processor, and therefore to solve the irreversibility problem. The concepts of xe2x80x9csegment cachingxe2x80x9d and xe2x80x9chidden statexe2x80x9d are thus equivalent. Cached descriptors are required for correct and complete virtualization.
Shadow descriptors, on the other hand, are optional, but they greatly improve performance. In particular, they enable direct execution when none of the virtual processors is in a non-reversible state.
Shadow descriptors allow VM instructions to execute directly, but with reduced privileges, and with a restricted address space, thus protecting the VMM. A shadow descriptor is an almost identical copy of a VM descriptor, with the privilege level and segment limit modified slightly; cached descriptors will, similarly, in general also have truncated limits and reduced privileges.
Whenever the VM changes a VM descriptor, the VMM then synchronously, that is, effectively immediately, updates the corresponding shadow descriptor. In order to track such changes, the invention disclosed in the parent application preferably uses the existing memory-tracing mechanism of the hardware, more specifically, the basic page-protection mechanisms provided by the hardware MMU: The VMM unmaps or protects specific VM pages (for example, those containing descriptor tables, but also other data structures) to detect accesses to those pages, without the VM""s knowledge of this happening. The hardware then generates an exception, which is sensed by the VMM, whenever the VM writes to a memory page (the smallest portion of memory for which memory tracing is provided) where, for example, the VM""s descriptors are stored. The VMM then updates the shadow copies of the descriptors on that page. This use of page protections for this specific purpose is referred to as xe2x80x9cmemory tracing.xe2x80x9d
One shortcoming of the arrangement disclosed in the parent application is that the shadow descriptors are always synchronized and are thus often updated unnecessarily: Assume, for example, that a VM sets a new descriptor table (DT) occupying 16 pages. On a system with the x86 architecture, fully 8192 descriptors would then be synchronized, even though only a few will in all likelihood ever be used. This is due primarily to the level of granularity of the memory-tracing mechanism provided by the hardware. The disclosed invention provides other mechanisms that allow the VMM to avoid having to update entire pages worth of descriptors due to VM access of a single descriptor, but at the cost of additional VMM actions that slow down processing.
In general, the problem is that there needs to be some way to ensure that the VM descriptors that the VMM shadows actually correspond to the most current VM descriptors that need to be shadowed. At the same time, the efficiency of the shadowing process should be improved, so that shadow descriptors are updated only when there is no need to do so. This invention provides such a mechanism.
The invention provides a method and related system for virtualizing a computer system that has a memory, which has a plurality of memory segments, each corresponding to a range of the memory the computer. A virtual machine monitor (VMM) is loaded into the computer system; in the case in which the underlying computer system (the xe2x80x9chardwarexe2x80x9d) is a software simulation or emulation of a physical hardware platform, the VMM is operatively connected to the simulated or emulated hardware. At least one virtual machine (VM), which has at least one virtual processor, is operatively connected to the VMM for running a sequence of VM instructions. The virtual machine (VM) has at least one VM descriptor table that has, as entries, VM segment descriptors, each VM segment descriptor containing memory location identifiers corresponding to a memory segment.
At least one VMM descriptor table is set up in the VMM. This VMM descriptor table includes at least one shadow descriptor table that stores, for predetermined ones of the VM segment descriptors, corresponding shadow descriptors. Each of the predetermined ones of the VM segment descriptors for which a shadow descriptor is stored is then a xe2x80x9cshadowedxe2x80x9d descriptor. The VMM compares the shadow descriptors with their respective corresponding shadowed VM descriptors; detects a lack of correspondence between the shadow descriptor table and the corresponding VM descriptor table; and updates and thereby synchronizes each shadow descriptor with its respective shadowed VM descriptor no later than the first use upon a first use of the descriptor by the VM, and preferably not until the time the VM first uses the descriptor.
The VMM synchronizes selected ones of the VM segment descriptors. Here xe2x80x9csynchronizationxe2x80x9d means updating the shadow descriptors in the VMM to maintain correspondence with their respective shadowed, synchronized VM descriptors upon every change by the VM to the shadowed, synchronized VM descriptors.
In the preferred embodiment of the invention, the memory segments are hardware memory segments. The computer system includes at least one hardware segment register and at least one hardware descriptor table that has, as entries, hardware segment descriptors. As with other segment descriptors, each hardware segment descriptor contains memory location identifiers corresponding to a memory segment.
The VMM then prevents the VM from accessing and loading unsynchronized shadow descriptors into any hardware segment register. In order to prevent such access and loading, the VMM first detects attempts to do so, using either or both of two mechanisms. One detection mechanism is the tracing of entire memory pages in which VM descriptors are stored; another involves sensing and setting the state of a segment present bit for individual descriptors.
Accordingly, the computer system has a memory management unit (MMU), preferably the preexisting hardware MMU in the computer system. The MMU includes a memory map and is able to trace accesses to designated memory pages and cause a page fault to be generated upon any attempt by software (such as the VM) to access a memory page that is not mapped, that is, not included in the memory map. In the preferred embodiment of the invention, the VMM descriptor table and thus also the shadow descriptors are stored on VMM-associated memory pages. The VMM then selectively maps and unmaps the VMM-associated memory pages. This causes page faults to be generated upon any attempt to load into any hardware segment register a shadow descriptor located on any unmapped VMM-associated memory page.
Upon attempted loading by the VM of a VM segment descriptor into a hardware segment register, the VMM then determines whether a corresponding shadow descriptor exists and is synchronized in the VMM descriptor table. If both these conditions are met, then the corresponding synchronized shadow descriptor is allowed to be loaded. If, on the other hand, the VMM determines that a corresponding synchronized shadow descriptor does not exist, it also determines whether the page containing the VM segment descriptor is mapped. If it is unmapped, then the VMM forwards a corresponding fault to the VM. If the page containing the VM segment is mapped, however, the VMM also maps the page on which the corresponding shadow descriptor is located; synchronizes the shadow descriptor with the VM segment descriptor that the VM attempted to load; and restarts attempted loading of the VM segment descriptor into the respective hardware segment register. This procedure ensures loading of only synchronized shadow descriptors.
In order to enable whole-page detection of attempted loading into any hardware segment register of any shadow descriptors located on a respective page, the VMM determines if the page of the VMM descriptor table containing the synchronized shadow descriptor contains only shadow descriptors. If it does, then the VMM synchronizes all shadow descriptors on the page whenever any one of them is synchronized.
The VMM preferably also senses changes of mapping of any page(s) of the VM descriptor table. The VMM then desynchronizes and prevents access by the VM to all shadow descriptors that correspond to shadowed descriptors on the page(s) of the VM descriptor table whose mapping is being changed. To ensure reversibility of each such VM segment descriptor, the VMM determines whether any VM segment descriptor on the page(s) of the VM descriptor table whose mapping is being changed is currently loaded in any hardware segment register. For each such VM segment descriptor so loaded, the VMM creates a cached copy of the corresponding shadow descriptor.
In order to make possible sensing of accesses or changes to individual descriptors, regardless of other data located on the same memory page, the VMM sets a protection attribute of shadow descriptors to a not present state. Upon attempted loading by the VM of a VM segment descriptor, the VMM then determines whether the protection attribute of the corresponding shadow descriptor is in a present state or in the not present state. If the protection attribute is in the present state, then the VMM loads the corresponding shadow descriptor. If, however, the VMM determines that the protection attribute is in the not present state, the VMM determines whether the page on which the VM segment descriptor is located is mapped. If it is unmapped, then the VMM forwards a corresponding fault to the VM. If, however, the VM segment descriptor is mapped, the VMM synchronizes the shadow descriptor with the VM segment descriptor that the VM attempted to load; sets the protection attribute of the shadow descriptor to the present state; and restarts attempted loading of the VM segment descriptor into the respective hardware segment register.