Various forms of computer system virtualization have been used with varying degrees of success to improve utilization of capabilities of physical resources present and available in a given computing system platform. In general terms, virtualization enables functionally parallel execution of multiple computer system environments on a given hardware platform. These computer system environments embed guest operating systems and, by the virtualization, can represent, to varying degrees, computing platforms dissimilar from the underlying physical system platform.
Virtualization systems are typically implemented using a virtual machine monitor (VMM), also frequently referred to as a hypervisor, that provides support and coordinated control over one or more co-executed virtual machines (VMs). Each virtual machine represents a discrete execution environment that encapsulates a virtual platform, guest operating system, and address space for the execution of application programs. Over the years, various specific approaches for implementing virtual machine monitors have been proposed and implemented.
Conventional approaches to virtualization that can, at least theoretically, implement a virtual machine monitor include trap-and-emulate, para-virtualization, and binary translation. Trap-and-emulate virtualization relies on a platform central processing unit (CPU) to implement a privilege model that will raise an exception whenever a privilege-dependent instruction is executed in an unprivileged context. Privilege-dependent instructions can be generally classified as those instructions that directly modify a security state of the executing CPU, as those instructions whose execution behavior varies dependent on the privilege level of the execution context, and as those instructions that can be used to reveal the security state of the CPU to enable conditional program execution. In a so-called classically virtualizable computer architecture, all privilege-dependent instructions will raise an exception when executed in an unprivileged context.
A classical trap-and-emulate virtualization system provides for direct execution of a guest operating system within a virtual machine, though at an unprivileged security level. In this system, the virtual machine monitor is executed at a privileged level, and privilege exceptions raised in executing the guest operating system are trapped by the virtual machine monitor. The trapped instruction and related execution context are then evaluated by the virtual machine monitor as needed to enable emulation of the intended guest operating system function that invoked the trapped exception.
In greater detail, conventional operating systems are nominally implemented to make use of a supervisor/user privilege system. The operating system kernel and certain essential services execute with supervisory rights, while non-essential operating system and user applications execute with reduced user rights. In a typical x86-based architecture, ring-0, 1, 2, and 3 privilege levels are supported by hardware controls. Operating systems conventionally execute at the ring-0 privilege level, while user applications commonly execute at ring-3. Some specialized user-level applications can be run at ring-1 and, for reasons not relevant here, ring-2 is rarely if ever used. The distinction between ring-0 and the higher, less privileged rings is nominally enforced by hardware architecture security controls by raising privilege exceptions if certain privilege dependent instructions are executed outside of ring-0. Conventionally, a privilege exception is treated as a non-reentrant event, since a user-level program that executes a privileged instruction is typically terminated as a security precaution. Still, x86-based architectures do support the ability to restart execution of an instruction that invokes a privilege trap exception. Generation of a privilege exception results in a context switch to the ring-0 privilege level where the exception is handled by an associated exception handler.
The context switch and subsequent emulation operation of the virtual machine monitor imposes a performance overhead in the virtualized execution of guest operating systems. Optimizing this overhead performance is thus a concern in all virtual machine implementations. Unfortunately, the context switch and emulation overhead is not the only or even principal problem with trap-and-emulate virtualization systems. Rather, the principal problem is that the conventionally prevalent x86 architectural model is not a classically virtualizable architecture. While many privilege-dependent instructions will appropriately generate privilege exceptions, other standard x86 instructions cannot be made to generate privilege exceptions for activities that should be confined to ring-0 execution. For example, various x86 instructions can be used to modify the contents of certain x86 CPU-internal registers that contain control bits modifiable only in a ring-0 execution context. Other bits in these registers may be validly written outside of ring-0 execution. Any x86 instruction that attempts to modify the ring-0 constrained control bits outside of ring-0 execution will not only fail to generate a privilege exception, but the attempted modification will be silently ignored. Further, where the modification is attempted specifically by a deprivileged guest operating system kernel, the intended kernel behavior will not be realized. Consequently, the execution behavior of these instructions differs based on the privilege level of execution.
Another problem can arise for guest operating system modules intended to execute in both privileged and non-privileged circumstances. Given that the guest operating system is executed in user, rather than supervisory mode, any run-time differentiating test for privilege-level status implemented by such a module will always identify user-mode execution. The inability to execute privileged operations as intended in the design and implementation of the module will compromise the function of the module and guest operating system as a whole.
Since the conventional x86 architecture does not raise exceptions on execution of all privilege-dependent instructions, the x86 architecture is not classically virtualizable. A further discussion of these problems can be found in the article, Robin, J. S. & Irvine, C. E., “Analysis of the Intel Pentium's Ability to Support a Secure Virtual Machine Monitor,” Proceedings of the 9th USENIX Security Symposium, Denver, Colo., August 2000.
Para-virtualization takes a different approach to dealing with the existence of privilege-dependent instructions in non-classically virtualizable architectures. As with trap-and-emulate virtualization, para-virtualization systems implement a virtual machine monitor to provide supervisory control over the co-execution of the virtual machines. While the guest operating systems similarly execute deprivileged on the underlying platform, para-virtualization requires the guest operating systems to be directly aware of, and invoke, the virtual machine monitor to handle circumstances involving privilege-dependent instructions. Since conventional operating systems are implemented without provision for interacting with a virtual machine monitor, standard para-virtualization implementations require the guest operating systems to be specifically modified to support virtualization. That is, typically source-code level modification of a guest operating system is required at every point where execution of a privilege-dependent instruction in a deprivileged context could result in an undesirable behavior.
The para-virtualization virtual machine monitor typically contains library routines, accessible from the guest operating systems, that appropriately emulate necessary guest operating system privileged functions. A current, conventional implementation of a para-virtualization virtual machine monitor, known as Xen 3.0, is available from XenSource, Inc., based in Palo Alto, Calif. A drawback to para-virtualization is a requirement to modify the guest operating system core kernel to support virtual machine monitor interactions. Conventionally, each different type and version of each guest operating system supported must be modified. In many instances, access to the required components of the operating system is not available. Given the core kernel location of the modifications required, a significant testing burden is incurred to ensure that kernel operations are not unintentionally affected directly or indirectly in the ability to support consistent behavioral execution of higher operating system layers and applications.
Binary translation-based virtualization systems, like trap-and-emulate and para-virtualization systems, typically implement a virtual machine monitor to functionally manage and coordinate execution of guest operating systems within virtual machines. The virtual machine monitor executes in a privileged context and manages the execution of the guest operating systems. As described in, for example, U.S. Pat. No. 6,397,242, issued to Devine et al., and assigned to the assignee of the present application, the virtual machine monitor performs a run-time analysis of the instruction execution stream to identify occurrences of privilege-dependent instructions that, if executed unaltered, could result in undesirable system behavior. The run-time analysis is performed by a binary-to-binary translator that emits a functionally equivalent instruction stream that incorporates emulations of the privilege-dependent instructions. Depending on the nature and use of a privilege-dependent instruction, the binary translation results produces some combination of rewritten instructions and call-outs to library routines appropriate to emulate the function of the guest operating system intended to be performed by the privilege-dependent instruction segment. The resulting translated instruction stream is preferably cached, and thereafter executed in replacement of the corresponding portion of the guest operating system.
Although the initial processing and binary translation of an instruction stream imposes a performance burden, subsequent execution of the translated instruction stream from the translation cache achieves near native performance. Given that relatively small portions of modern operating systems are predominantly and repeatedly executed, the overall performance realizable using binary translation-based virtualization is substantial. Binary translation-based virtualization systems thus realize the benefit of supporting non-classically virtualizable architectures without requiring the source-level guest operating system modifications of para-virtualization and without the ongoing performance burden of exception handling overhead every time a privilege-dependent instruction is executed, as incurred under purely trap-and-emulate virtualization.
The existence of privilege-dependent instructions in non-classically virtualizable architectures, such as the x86 architecture, has been long recognized. Only recently, however, a number of hardware-based extensions of the x86 architecture have been proposed and, to varying degrees, implemented to support partitioning virtualization. In particular, Intel Corporation has implemented a virtualization technology, or VT, extension that provides hardware-based support for partitioning virtualization in an otherwise non-classically virtualizable architecture. Other vendors, such as Advanced Micro Devices, Inc., have introduced similar extensions in their microprocessor designs. Given the functional similarity, for purposes of discussing the present invention, all of the hardware-based virtualization extensions can be generically referred to as VT extensions.
In summary, VT introduces a privilege overlay system defining two privilege classes. Relative to the conventional x86 privilege model, a new VMX non-root class, functionally containing to a standard x86 ring-0, 1, 2, 3 privilege model, has been added. The conventional x86 privilege model is identified as the VMX root class. In use, a virtual machine monitor implementing a VT trap handler will execute in the VMX root ring-0. By executing guest operating systems in the VMX non-root ring-0, many problems with privilege dependent instructions are resolved; the guest operating systems run in their intended privileged execution mode. Remaining virtualization issues, specifically those arising from the conventionally non-classically virtualizable nature of the x86 architecture, are handled by a controlled deprivilegization of the VMX non-root ring-0 relative to the VMX root ring-0. That is, VT implements VM exit and VM entry operations that encapsulate transitions between the VMX non-root and root privilege states to add exception handling for those privilege dependent instructions and events that do not conventionally raise privilege exceptions. The execution of these non-classically virtualizable instructions and occurrence of certain operating conditions, particularly related to memory paging, interrupt handling and programmed I/O operations, will, either automatically, or as determined by VT-defined control vectors, force a VM exit transition. This allows a VT trap handler implemented within the virtual machine monitor to handle these specific conditions consistently with respect to the parallel array of virtual machines, and thereby maintain overall operational integrity.
Although developed as a more direct approach to supporting partitioning virtualization, and substantially simplifying the implementation of virtual machine monitors, there are inherent limitations to the use of VT and other, similar, hardware-based virtualization support techniques. In particular, the fundamental operation of VT converts many of the privilege dependent instructions into the equivalent of, if not actual, heavy-weight context switches. That is, while essentially implemented in hardware, the VM exit and VM entry transitions require fairly extensive amounts of state information to be preserved and restored from virtual machine control structures on both VM exit and VM entry transitions. The significant processing burden of VM exit and VM entry transitions can be particularly problematic where privilege dependent instructions occur in performance sensitive execution flows within typical guest operating systems. For example, several privilege dependent instructions are characteristically invoked in the management of page tables. In execution of conventional operating system kernels, page table manipulation is rather frequently performed, given the conventional presumption that performance cost is negligible and optimizing memory access is particularly desirable. A VT-type hardware-based virtualization support system as implemented in conventional virtual machine monitors will typically impose a VM exit and VM entry transition on these page table modifications. The overall result is that, for operating systems that frequently invoke privilege dependent instructions, VT-type systems will incur virtualization overheads that are not only significant, but noticeable in practical use.
In addition, a substantial processing burden is imposed by the virtual machine monitor being required to evaluate, for purposes of emulation, the intended operation of the privilege dependent instruction that initiates each VM exit. Although a VM exit transition captures significant state information as part of the hardware implemented VM exit transition, the virtual machine monitor resident VM exit handler must determine anew the intended operation and execution context of the privilege dependent instruction. Typically, the virtual machine monitor operates to decode the privilege dependent instruction and further analyze potentially disparate aspects of the execution context of the guest operating system to correctly characterize and implement an emulated execution of a privilege dependent instruction. Since this decode and analysis is performed following from each VM exit transition, the VT analysis and emulation of trapped privilege dependent instructions is also a substantial source of virtualization overhead.
VT-type hardware-based virtualization does, however, provide significant benefits in certain areas relative to para-virtualization and binary translation virtualization techniques. Relative to para-virtualization, VT virtualization enables execution of unmodified guest operating systems. In comparison to binary translation virtualization, VT virtualization does not impose initial execution overhead, and allows system calls by application programs to the guest operating systems to be executed without intervention by the virtual machine monitor. Also, since VT virtualization does not require a translation cache, a VT virtual machine monitor will require less memory.