1. Field of the Invention
This invention relates to a computer architecture, including a virtual machine monitor, and a related operating method that allow virtualization of the resources of a modern computer system.
2. Description of the Related Art
The operating system plays a special role in today""s personal computers and engineering work stations. Indeed, it is the only piece of software that is typically ordered at the same time the hardware itself is purchased. Of course, the customer can later change operating systems, upgrade to a newer version of the operating system, or even re-partition the hard drive to support multiple boots. In all cases, however, a single operating system runs at any given time on the computer. As a result, applications written for different operating systems cannot run concurrently on the system.
Various solutions have been proposed to solve this problem and eliminate this restriction. These include virtual machine monitors, machine simulators, application emulators, operating system emulators, embedded operating systems, legacy virtual machine monitors, and boot managers.
Virtual Machine Monitors
One solution that was the subject of intense research in the late 1960""s and 1970""s came to be known as the xe2x80x9cvirtual machine monitorxe2x80x9d (VMM). See, for example, R. P. Goldberg, xe2x80x9cSurvey of virtual machine research,xe2x80x9d IEEE Computer, Vol. 7, No. 6, 1974. During that time, moreover, IBM Corp. adopted a virtual machine monitor for use in its VM/370 system.
A virtual machine monitor is a thin piece of software that runs directly on top of the hardware and virtualizes all the resources of the machine. Since the exported interface is the same as the hardware interface of the machine, the operating system cannot determine the presence of the VMM. Consequently, when the hardware interface is compatible with the underlying hardware, the same operating system can run either on top of the virtual machine monitor or on top of the raw hardware.
Virtual machine monitors were popular at a time where hardware was scarce and operating systems were primitive. By virtualizing all the resources of the system, multiple independent operating systems could coexist on the same machine. For example, each user could have her own virtual machine running a single-user operating system.
The research in virtual machine monitors also led to the design of processor architectures that were particularly suitable for virtualization. It allowed virtual machine monitors to use a technique known as xe2x80x9cdirect execution,xe2x80x9d which simplifies the implementation of the monitor and improves performance. With direct execution, the VMM sets up the processor in a mode with reduced privileges so that the operating system cannot directly execute its privileged instructions. The execution with reduced privileges generates traps, for example when the operating system attempts to issue a privileged instruction. The VMM thus needs only to correctly emulate the traps to allow the correct execution of the operating system in the virtual machine.
As hardware became cheaper and operating systems more sophisticated, VMM""s based on direct execution began to lose their appeal. Recently, however, they have been proposed to solve specific problems. For example, the Hypervisor system provides fault-tolerance, as is described by T. C. Bressoud and F. B. Schneider, in xe2x80x9cHypervisor-based fault tolerance,xe2x80x9d ACM Transactions on Computer Systems (TOCS),Vol. 14. (1), February 1996; and in U.S. Pat. No. 5,488,716 xe2x80x9cFault tolerant computer system with shadow virtual processor,xe2x80x9d (Schneider, et al.). As another example, the Disco system runs commodity operating systems on scalable multiprocessors. See xe2x80x9cDisco: Running Commodity Operating Systems on Scalable Multiprocessors,xe2x80x9d E. Bugnion, S. Devine, K. Govil and M. Rosenblum, ACM Transactions on Computer Systems (TOCS), Vol. 15, No. 4, November 1997, pp. 412-447.
Virtual machine monitors can also provide architectural compatibility between different processor architectures by using a technique known as either xe2x80x9cbinary emulationxe2x80x9d or xe2x80x9cbinary translation.xe2x80x9d In these systems, the VMM cannot use direct execution since the virtual and underlying architectures mismatch; rather, they must emulate the virtual architecture on top of the underlying one. This allows entire virtual machines (operating systems and applications) written for a particular processor architecture to run on top of one another. For example, the IBM DAISY system has recently been proposed to run PowerPC and x86 systems on top of a VLIW architecture. See, for example, K. Ebcioglu and E. R. Altman, xe2x80x9cDAISY: Compilation for 100% Architectural Compatibility,xe2x80x9d Proceedings of the 24th International Symposium on Computer Architecture, 1997.
Machine Simulators/Emulators
Machine simulators, also known as machine emulators, run as application programs on top of an existing operating system. They emulate all the components of a given computer system with enough accuracy to run an operating system and its applications. Machine simulators are often used in research to study the performance of multiprocessors. See, for example, M. Rosenblum, et al., xe2x80x9cUsing the SimOS machine simulator to study complex computer systems,xe2x80x9d ACM Transactions on Modeling and Computer Simulation, Vol. 7, No. 1, January 1997. They have also been used to simulate an Intel x86 machine as the xe2x80x9cVirtualPCxe2x80x9d or xe2x80x9cRealPCxe2x80x9d products on a PowerPC-based Apple Macintosh system.
Machine simulators share binary emulation techniques with some VMM""s such as DAISY. They differentiate themselves from VMM""s, however, in that they run on top of a host operating system. This has a number of advantages as they can use the services provided by the operating system. On the other hand, these systems can also be somewhat constrained by the host operating system. For example, an operating system that provides protection never allows application programs to issue privileged instructions or to change its address space directly. These constraints typically lead to significant overheads, especially when running on top of operating systems that are protected from applications.
Application Emulators
Like machine simulators, application emulators also run as an application program in order to provide compatibility across different processor architectures. Unlike machine simulators, however, they emulate application-level software and convert the application""s system calls into direct calls into the host operating system. These systems have been used in research for architectural studies, as well as to run legacy binaries written for the 68000 architecture on newer PowerPC-based Macintosh systems. They have also been also been used to run x86 applications written for Microsoft NT on Alpha work stations running Microsoft NT. In all cases, the expected operating system matches the underlying one, which simplifies the implementation. Other systems such as the known Insigna""s SoftWindows use binary emulation to run Windows applications and a modified version of the Windows operating system on platforms other than PCS. At least two known systems allow Macintosh applications to run on other systems: the Executer runs them on Intel processors running Linux or Next and MAE runs them on top of the Unix operating system.
Operating System Emulators
Operating system (OS) emulators allow applications written for one given operating system application binary interface (ABI) to run on another operating system. They translate all system calls made by the application for the original operating system into a sequence of system calls to the underlying operating system. ABI emulators are currently used to allow Unix applications to run on Window NT (the Softway OpenNT emulator) and to run applications written for Microsoft""s operating systems on public-domain operating systems (the Linux WINE project).
Unlike virtual machine monitors and machine simulators, which are essentially independent of the operating system, ABI emulators are intimately tied with the operating system that they are emulating. Operating system emulators differ from application emulators in that the applications are already compiled for the instruction set architecture of the target processor. The OS emulator does not need to worry about the execution of the applications, but rather only of the calls that it makes to the underlying operating system.
Embedded Operating Systems
Emulating an ABI at the user level is not an option if the goal is to provide additional guarantees to the applications that are not provided by the host operating system. For example, the VenturCom RTX Real-Time subsystem embeds a real-time kernel within the Microsoft NT operating system. This effectively allows real-time processes to co-exist with traditional NT processes within the same system.
This co-existence requires the modification of the lowest levels of the operating system, that is, its Hardware Abstraction Layer (HAL). This allows the RTX system to first handle all I/O interrupts. This solution is tightly coupled with WindowsNT, since both environments share the same address space and interrupts entry points.
Legacy Virtual Machine Monitors
Certain processors, most notably those with the Intel architecture, contain special execution modes that are specifically designed to virtualize a given legacy architecture. This mode is designed to support the strict virtualization of the legacy architecture, but not of the existing architecture.
A legacy virtual machine monitor consists of the appropriate software support that allows running the legacy operating system using the special mode of the processor. Specifically, Microsoft""s DOS virtual machine runs DOS in a virtual machine on top of Microsoft Windows and NT. As another example, the freeware DOSEMU system runs DOS on top of Linux.
Although these systems are commonly referred to as a form of virtual machine monitor, they run either on top of an existing operating system, such as DOSEMU, or as part of an existing operating system such as Microsoft Windows and Microsoft NT. In this respect, they are quite different from the true virtual machine monitors described above, and from the definition of the term xe2x80x9cvirtual machine monitorxe2x80x9d applied to the invention described below.
Boot Managers
Finally, boot managers such as the public-domain LILO and the commercial System Commander facilitate changing operating systems by managing multiple partitions on the hard drive. The user must, however, reboot the computer to change perating systems. Boot managers therefore do not allow applications written for different operating systems to coexist. Rather, they simply allow the user to reboot another operating system without having to reinstall it, that is, without having to remove the previous operating system.
General Shortcomings of the Prior Art
All of the systems described above are designed to allow applications designed for one version or type of operating system to run on systems with a different version or type of operating system. As usual, the designer of such a system must try to meet different requirements, which are often competing, and sometimes apparently mutually exclusive.
Virtual machine monitors (VMM) have many attractive properties. For example, conventional VMMs outperform machine emulators since they run at system level without the overhead and constraint of an existing operating system. They are, moreover, more general than application and operating system emulators since they can run any application and any operating system written for the virtual machine architecture. Furthermore, they allow modern operating systems to coexist, not just the legacy operating systems that legacy virtual machine monitors allow. Finally, they allow application written for different operating systems to time-share the processor; in this respect they differ from boot managers, which require a complete xe2x80x9cre-boot,xe2x80x9d that is, system restart, between applications.
As is the typical case in the engineering world, the attractive properties of VMMs come with corresponding drawbacks. A major drawback is the lack of portability of the VMM itselfxe2x80x94conventional VMMs are intimately tied to the hardware that they run on, and to the hardware they emulate. Also, the virtualization of all the resources of the system generally leads to diminished performance.
As is mentioned above, certain architectures (so-called xe2x80x9cstrictly virtualizeablexe2x80x9d architectures), allow VMMs to use a technique known as xe2x80x9cdirect executionxe2x80x9d to run the virtual machines. This technique maximizes performance by letting the virtual machine run directly on the hardware in all cases where it is safe to do so. Specifically, it runs the operating system in the virtual machine with reduced privileges so that the effect of any instruction sequence is guaranteed to be contained in the virtual machine. Because of this, the VMM must handle only the traps that result from attempts by the virtual machine to issue privileged instructions.
Unfortunately, many current architectures are not strictly virtualizeable. This may be because either their instructions are non-virtualizeable, or they have segmented architectures that are non-virtualizeable, or both. Unfortunately, the all-but-ubiquitous Intel x86 processor family has both of these problematic properties, that is, both non-virtualizeable instructions and non-reversible segmentation. Consequently, no VMM based exclusively on direct execution can completely virtualize the x86 architecture.
Complete virtualization of even the Intel x86 architecture using binary translation is of course possible, but the loss of performance would be significant. Note that, unlike cross-architectural systems such as DAISY, in which the processor contains specific support for emulation, the Intel x86 was not designed to run a binary translator. Consequently, no conventional x86-based system has been able to successfully virtualize the Intel x86 processor itself.
What is needed is therefore a VMM that is able to function with both the speed of a direct-execution system and the flexibility of a binary-translation system. The VMM should also have an efficient switch between the two execution modes. This invention provides such a system.
The invention provides a system for virtualizing a computer. The invention comprises a hardware processor; a memory; a virtual machine monitor (VMM); and a virtual machine (VM). The VM has at least one virtual processor and is operatively connected to the VMM for running a sequence of VM instructions. The VM instruction include directly executable VM instructions and non-directly executable instructions.
The VMM according to the invention includes: a binary translation sub-system; a direct execution sub-system; and an execution decision module/sub-system that implements a decision function for discriminating between the directly executable and non-directly executable VM instructions, and for selectively directing the VMM to activate the direct execution subsystem for execution by the hardware processor of the directly executable VM instructions and to activate the binary translation subsystem for execution on the hardware processor of the non-directly executable VM instructions.
In a preferred embodiment of the invention, the hardware processor has a plurality of privilege levels, as well as virtualizeable instructions and non-virtualizeable instructions. The non-virtualizeable instructions have predefined semantics that depend on the privilege level, and the semantics of at least two of the privilege levels are mutually different and non-trapping. In this embodiment, the VM has a privileged operation mode and a non-privileged operation mode and the decision sub-system is further provided for directing the VMM to activate the binary translation sub-system when the VM is in the privileged operation mode.
According to another aspect of the invention, the hardware processor has a plurality of hardware segments and at least one hardware segment descriptor table that is stored in the memory and that has, as entries, hardware segment descriptors. The VM has VM descriptor tables that in turn have, as entries, VM segment descriptors. Furthermore, the virtual processor has virtual segments. In this preferred embodiment, the VMM includes VMM descriptor tables, including shadow descriptors, that correspond to predetermined ones of the VM descriptors tables. The VMM also includes a segment tracking sub-system/module that compares the shadow descriptors with their corresponding VM segment descriptors, and indicates any lack of correspondence between shadow descriptor tables with their corresponding VM descriptor tables, and updates the shadow descriptors so that they correspond to their respective corresponding VM segment descriptors.
The VMM in the preferred embodiment of the invention additionally includes one cached entry in the VMM descriptor tables for each segment of the processor, the binary translation sub-system selectively accessing each cached entry instead of the corresponding shadow entry. Furthermore, the hardware processor includes a detection sub-system that detects attempts by the VM to load VMM descriptors other than shadow descriptors, and updates the VMM descriptor table so that the cached entry corresponding to the processor segment also corresponds to the VM segment descriptor. The VMM thereby also uses binary translation using this cached entry until the processor segment is subsequently loaded with a VMM descriptor that is a shadow descriptor.
In another aspect of the invention, the hardware processor has predetermined caching semantics and includes non-reversible state information. The segment tracking sub-system is further provided for detecting attempts by the VM to modify any VM segment descriptor that leads to a non-reversible processor segment. The VMM then also updates the VMM descriptor table so that the cached entry corresponding to the processor segment also corresponds to the VM segment descriptor, before any modification of the VM segment descriptor. The decision sub-system is further provided for directing the VMM to activate the binary translation sub-system when the segment-tracking sub-system has detected creation of a non-reversible segment, and the binary translation sub-system uses the cached entry until the processor segment is subsequently loaded with a VMM descriptor that is a shadow descriptor.
According to yet another aspect of the invention, the hardware processor has a native mode; and the virtual processor in the VM has native and non-native execution modes, in which the non-native execution modes are independent of the VM segment descriptor tables for accessing segments. The decision sub-system is then further provided for directing the VMM to operate using the cached descriptors and to activate the binary translation sub-system when the hardware processor is in the non-native execution mode. The binary translation sub-system thereby uses the cached entry in the native mode when at least one of the following conditions is present: the virtual processor is in one of the non-native execution modes; and at least one virtual processor segment has been most recently loaded in one of the non-native execution modes.
According to still another aspect of the invention, the hardware processor and the virtual processor each has native and non-native execution modes, in which at least one of the non-native execution modes is strictly virtualizeable. The decision sub-system then directs the VMM to run in the same execution mode as the virtual processor.
In implementations of the invention in which the hardware processor has a memory management unit (MMU), the invention further comprises a memory tracing mechanism, included in the VMM, for detecting, via the MMU, accesses to selectable memory portions. The segment tracking sub-system is then operatively connected to the memory tracing mechanism for detecting accesses to selected memory portions.
The invention is particularly well-suited for virtualizing computer systems in which the hardware processor has an Intel x86 architecture that is compatible with at least the Intel 80386 processor. Where the hardware processor has an Intel x86 architecture with at least one non-virtualizeable instruction, and the virtual processor in the VM also has the Intel x86 architecture, the virtual processor has a plurality of processing states at a plurality of current privilege levels (CPL), an input/output privilege level, and means for disabling interrupts. In such a system, the decision sub-system is further provided for directing the VMM to activate the binary translation sub-system whenever at least one of the following conditions occur: a) the CPL of the virtual processor is set to a most privileged level; b) the inpuvoutput privilege level of the virtual processor is greater than zero; and c) interrupts are disabled in the virtual processor. The VMM, by means of the binary translation sub-system, thereby virtualizes all non-virtualizeable instructions of the virtual processor as a predetermined function of the processing state of the virtual processor.
In the preferred embodiment of the invention, the hardware processor has an Intel x86 architecture with a protected operation mode, a real operation mode, and a system management operation mode. The VMM then operates within the protected operation mode and uses binary translation to execute VM instructions whenever the real and system management operation modes of the processor are to be virtualized. On the other hand, where the hardware processor has an Intel x86 architecture with a strictly virtualizeable virtual 8086 mode, the VMM uses direct execution whenever the virtual 8086 mode of the processor is to be virtualized.
The invention can also be used for virtualizing systems in which the computer has a plurality of hardware processors. In such cases, the invention further comprises a plurality of virtual processors included in the virtual machine; and, in the VMM, VMM descriptor tables for each virtual processor. The segment tracking sub-system then includes means for indicating to the VMM, on selected ones of the plurality of hardware processors, any lack of correspondence between the shadow descriptor tables and their corresponding VM descriptor tables. Additionally, for each hardware processor on which the VMM is running, the decision sub-system discriminates between the directly executable and the non-directly executable VM instructions independent of the remaining hardware processors.