1. Field of the Invention
This invention relates in general to multiprocessor computer systems and in particular to systems that include one or more virtual machines.
2. Description of the Related Art
Researchers and engineers in the field of computer science have developed, and continue to refine, principles and technologies for the construction of efficient and economical general-purpose computer systems. These known systems fall generally into two main categories, namely, those that involve virtualization technology and those that do not. In order to understand the former, it is helpful to understand the latter.
Most computer systems can be conveniently described in terms of layers, in particular, a lower, a middle, and an upper layer. See, for example, Andrew S. Tanenbaum's “Structured Computer Organization,” second edition, Prentice/Hall International, 1984.
The lowest layer is the hardware, which includes a processor (CPU), volatile and non-volatile memory, a memory controller (MMU), disk drives and other input and output devices. A wide range of hardware configurations that offer different trade-offs between performance, compatibility, cost, reliability, power consumption and other parameters have been deployed over the years. For example, the Personal Computer (PC) platform, originally introduced by IBM in 1981, uses an Intel-compatible (“x86”) CPU. This platform has been widely adopted and has resulted in the development of several subsequent industry standards. Today PCs with compatible hardware layers can be obtained from multiple vendors.
The middle layer, the operating system (OS), is a software layer. Operating systems are among the most complex and largest units of software built and may themselves be internally layered or broken into separate modules using some other organizing principle. Broadly speaking, however, an OS manages the raw hardware layer in the computer system, abstracts it, and augments it with services that are commonly used by software in the higher layers of abstraction (that is, application-level programs). For example, the OS may manage raw hard disks, and perform allocation and scheduling functions to give application programs access to a hierarchical file system. The OS may also offer services or software libraries that are not directly related to any particular hardware device in the computer system itself; for example, it may provide the means for application programs to communicate with each other through shared memory or over a wide-area network. By hiding arbitrary dissimilarities of the hardware layer, and augmenting this layer with additional functionality, operating systems can be seen as infrastructure that enables the construction of higher-level software that works predictably across a family of (somewhat) different hardware configurations.
Over the years, many different operating systems have been developed, reflecting the wide range of uses of computer systems and the diversity of hardware. Some operating systems are restricted to one hardware platform; for example, the Windows98 operating system runs on the PC platform only. Others run on a plurality of platforms; the Solaris operating system, for example, runs on both x86 and SPARC hardware.
The highest layer, the set of application programs, is what most users of computer systems ultimately interact with and care about. This application layer builds upon the general-purpose hardware and operating system layers in order to solve concrete computational and information processing problems. Whereas each unit of hardware ordinarily hosts one OS, in most computer systems, the OS will host a number of application programs, possibly from many different vendors.
To summarize the above discussion, most computer systems can be viewed as consisting of one unit of hardware, one OS that manages the hardware, and a set of application programs running on top of the OS. The choice of hardware will restrict the range of possible OSs and the subsequent choice of OS will, together with the hardware choice, determine the available set of application programs.
The second broad category of general-purpose computer systems includes virtualization technology, in particular, at least one software construct known as a “virtual machine” (VM). As in non-virtualized systems, however, even these build upon hardware and system software layers. FIG. 1 shows the main components of a typical virtualized computer system, which includes an underlying system hardware platform 100 and system software 200.
The system hardware 100 includes one or more central processors CPU(s) 110, which may be a single processor, or two or more cooperating processors in a known multiprocessor arrangement. As in most computers, two different types of data storage are commonly provided: a system memory 112, typically implemented using any of the various RAM technologies, and a usually higher-capacity storage device 114 such as one or more memory disks. The hardware usually also includes, or is connected to, conventional registers, interrupt-handling circuitry, etc., as well as a memory management unit MMU 116. The system software 200 typically includes an operating system OS 220, which will include a conventional fault and interrupt handler 270 as well as drivers 222 as needed for controlling and communicating with the various devices 400 and, usually, for the disk 114 itself.
FIG. 1 also shows that conventional peripheral devices 400 may be connected to run on the hardware 100 via the system software 200. Conventional applications 600 may also be installed to run on the system software 200.
As is well known in the art, a virtual machine (VM) is a software abstraction—a “virtualization”—of an actual physical computer system. As such, each VM will typically include one or more virtual CPUs 310 (VCPU), a virtual operating system 320 (VOS) (which may, but need not, simply be a copy of a conventional, commodity OS), virtual system memory 312 (VMEM), a virtual disk 314 (VDISK), virtual peripheral devices 340 (VDEVICES) and drivers 322 (VDRV) for handling the virtual devices 340, all of which are implemented in software to emulate the corresponding components of an actual computer. As in any other operating system, the VOS will also include a fault or interrupt handler 370, which takes appropriate, predefined actions whenever any virtual CPU (or application 360) performs some action that causes the generation of a fault or interrupt signal.
Of course, most computers are intended to run various applications, and VMs are usually no exception. Consequently, by way of example, FIG. 1 illustrates a group of applications 360 (which may be a single application) installed to run on the VOS 320; any number of applications, including none at all, may be loaded for running on the VOS, limited only by the requirements and purposes of the VM. If the VM is properly designed, then the applications (or the user of the applications) will not “know” that they are not running directly on “real” hardware. Of course, all of the applications and the components of the VM are instructions and data stored in memory, just as any other software. The concept, design and operation of virtual machines are well known in the field of computer science. As FIG. 1 illustrates, several VMs 300-1, . . . , 300-n may be installed to run on a common hardware platform; all may have essentially the same general structure, although they may differ in particulars, including possibly having different operating systems.
Some interface is usually required between a VM and the underlying “real” OS 220 and hardware, which are responsible for actually executing VM-issued instructions and transferring data to and from the actual, physical memory and storage devices 112, 114. In this context, “real” means being either the native OS of the underlying physical computer or other system-level software that handles actual I/O operations, takes faults and interrupts, etc. The interface between the VM and the underlying system software layer and/or hardware is often referred to as a virtual machine monitor (VMM).
A VMM is usually a thin layer of software that runs directly on top of a host, such as the system software 200, or directly on the hardware, and virtualizes all the resources of the machine. The VMM usually tracks and either forwards (to the OS 220) or itself schedules and handles all requests by its VM for machine resources and will typically include software components such as device emulators 540, a memory management unit 512, etc. The interface exported to the respective VM is the same as the hardware interface of the machine, or at least of some predefined hardware platform, so that the virtual OS cannot determine the presence of the VMM, although the VMM will be aware of the VOS. The general features of VMMs are known in the art and are therefore not discussed in detail here.
The VMM also includes a sub-system 570 for taking and either handling or forwarding faults and interrupts. Note that “handling” a fault or interrupt involves executing some predetermined routine, which will depend on the type of fault/interrupt involved.
In FIG. 1, VMMs 500-1, . . . , 500-n, are shown, acting as interfaces for their respective attached VMs 300-1, . . . , 300-n. In the figures, VMs are shown as software entities separate from their respective VMMs. This separation reflects the fact that a VM is, from the viewpoint of a user, a “complete” computer system in its own right, with the VMM remaining transparent to the VM. Considering that both the VM and the VMM are software entities running on the system hardware 100, with out without help from the host system software 200, each VM and its related VMM may, however, also be viewed substantially as a unit: The VM cannot function properly without the VMM or a similar software system, and the VMM has no purpose other than to support the VM. Moreover, it would also be possible to use a single VMM to act as the interface to more than one VM, with the VMM exporting multiple instances of the machine interface. The important point is simply that some well-defined, known interface should be provided between each VM and the underlying system hardware 100 and software 200.
Assume a given hardware platform 100. For instance, this platform could be the Personal Computer platform. The VMM is thus a software program that exports to its respective VM an abstraction of a hardware platform, which may, but need not be, the same as the platform 100. Each feature of the hardware platform will typically (but not necessarily) have a software (virtual) implementation in the VMM. For example, in most virtualized computer systems, the VMM will export a virtual CPU (VCPU) that executes the same instruction set as the hardware CPU, and the VMM will export virtual disk drives and random-access memory whose properties are equivalent to the disk drives and memory implemented in the hardware platform.
In some configurations, each VMM runs directly on the hardware platform and is the only software to do so. In this situation, it is natural to think of the VMM as an additional software layer inserted between the hardware and operating system layers of conventional computer systems.
In other configurations, such as the one illustrated in FIG. 1, the VMM runs side by side with the OS 220, which forms the so-called “host” OS. In this situation, it is still possible to view the VMM as an additional software layer inserted between the hardware and the “guest,” that is, virtual OS 320, although the layering differs between the left and right half of the figure. In the left half of FIG. 1, where the host operating system 220 is situated, there is a conventional computer system with three layers: system hardware 100, system software 200 and applications 600. In the right half of the figure, which includes the VMM, we have a four-layer computer system: system hardware, system software, VMM, VM (including user-level applications 360). The software stacks shown in the left and the right halves of the figure operate largely as independent co-routines with the main interaction being that the host OS 220 may be called upon to perform services for the VMM.
It may in some cases be beneficial to deploy VMMs on top of a thin software layer, a “kernel,” constructed specifically for this purpose. Contrasting with a system in X which VMMs run directly on the hardware platform, use of a kernel offers greater modularity and facilitates provision of services that extend across multiple virtual machines (for example, resource management). Contrasting with the hosted deployment, a kernel may offer greater performance because it can be co-developed with the VMM and be optimized for the characteristics of a workload consisting of VMMs.
One may also concurrently run multiple VMM instances (as illustrated in FIG. 1), each of which exports one instance of the machine interface. From these scenarios, one of the principal advantages of virtualization in general and of using VMMs in particular follows directly: One may execute multiple operating systems on a single hardware platform. The collection of operating systems deployed on VMMs may be diverse, possibly ranging over operating systems from multiple vendors, or it may be uniform, comprising multiple instances of the same operating system image.
Virtual machine monitors have a long history, dating back to mainframe computer systems in the 1960s. See, for example, Robert P. Goldberg, “Survey of Virtual Machine Research,” IEEE Computer, June 1974, p. 34–45. Over the years, their popularity has risen and fallen with the changes in the prevailing hardware and software environments. Initially, VMMs were viewed as a way to increase the utilization of expensive hardware resources by multiplexing software environments. Later, hosted VMMs saw increasing use because they grant users of powerful personal computers simultaneous access to multiple operating systems and their application program sets. Use of VMMs to increase utilization of hardware resources has, however, once again been increasing in popularity: VMMs allow a single hardware computer system to co-host multiple independent servers (services), avoiding interference by deploying each server on its own (guest) operating system and managing resources in a kernel layer to allow each server to achieve specified service rates.
Most personal computer systems are equipped with a single processing unit (CPU). Because CPUs today are quite fast, a single CPU often provides enough computational power to handle several “concurrent” tasks by rapidly switching from task to task (a process sometimes known as time-slicing or multiprogramming). This management of concurrent tasks is one of the main responsibilities of almost all operating systems.
The use of multiple concurrent tasks often allows an overall increase in the utilization of the hardware resources. The reason for this is that while one task is waiting for input or output to happen, the CPU may execute other “ready” tasks. See, for example, Abraham Silberschatz and James L. Peterson, “Operating System Concepts,” Alternate Edition, Chapter 4, Addison-Wesley Publishing Company, 1989. As the number of tasks increases, however, the point may be reached where computational cycles, that is, CPU power, is the limiting factor. The exact point where this happens depends on the particular workloads.
Consequently, to permit computer systems to scale to larger numbers of concurrent tasks, systems with multiple CPUs have been developed. Such shared (or “symmetric”) memory multi-processor (SMP) systems are available as extensions of the PC platform, as well as from other vendors. Essentially, an SMP system is a hardware platform that connects multiple processors to a shared main memory and shared I/O devices. In addition, each processor may have private memory. The operating system, which is aware of the multiple processors, allows truly concurrent execution of multiple tasks, using time-slicing only when the number of ready tasks exceeds the number of CPUs.
In some SMP systems, a single operating system image manages the entire set of CPUs in concert. In other systems, especially those with larger numbers of CPUs, the hardware layer may provide a (physical) partitioning of the system, thereby allowing distinct operating system instances to manage each partition. Partitioning helps overcome scalability bottlenecks in operating systems and increases fault isolation.
Because VMMs and SMP systems both aim to increase the utilization of hardware resources, it is quite attractive to look into ways to enable the convergence of the two technologies. One way to achieve this is to run multiple uniprocessor virtual machines on an SMP system. This simple approach is quite beneficial: The increased a availability of computational resources on an SMP hardware layer will usually allow the system to handle a larger number of VMMs than can a uniprocessor system. This arrangement can be viewed as software (or logical) partitioning of an entire SMP system or an SMP partition. Compared with a pure hardware partitioning scheme, use of VMMs offers significant advantages. For instance, the VMMs can partition at a finer grain than is convenient for hardware. Indeed, VMMs can partition down to fractional CPU-equivalents by running more VMMs than there are CPUs. It is relatively straightforward to run multiple uniprocessor VMMs on an SMP hardware system. For example, if a kernel approach is used, only the kernel need be aware of the multiple processors on the underlying SMP system.
Unfortunately, if one is restricted to uniprocessor VMMs, then some types of applications that have been written to take advantage of multiple processors will be unable to run well in virtual machines. This leads to the second way to combine the benefits of multiprocessor technology and virtual machine monitors: One may develop a VMM that “exports” to the guest operating system an SMP abstraction. This VMM will then be able to harness the computational power of multiple CPUs and channel them to the guest operating system where they can be put into service. More precisely, a multiprocessor virtual machine monitor generalizes a uniprocessor virtual machine monitor in that it presents the appearance to guest operating systems of running on (virtual) hardware with multiple CPUs.
In the most general case, there is no a priori restriction on the number p of CPUs on the hardware platform and the number v of virtual CPUs that the VMM exports to the guest operating system 320. One may have p<v, which would necessitate use of time-slicing of the physical CPUs to keep all the virtual CPUs running, or one may have p=v, making it possible to run exactly one VMM at a time on the hardware (but time-slicing these), or one may have p>v, making it possible to run (at least) one VMM with v virtual CPUs and still have physical CPUs available to run other VMMs or services concurrently.
Different implementation techniques are known for uniprocessor VMMs, which focus on two areas of relevance: execution of the virtual machine instruction stream and virtualization of memory and memory-mapped structures.
The traditional way to implement a VMM, dating back to the original mainframe VMMs, involves running the guest operating system code at a less privileged level in the virtual machine than it would have had on a physical machine. When running with lesser privileges, any attempt by the guest to execute a privileged instruction will generate a “trap” (also known as an “exception”). When the VMM senses the trap, it takes control. Effectively, the VMM intercepts any attempt by the (under-privileged) guest operating system to execute a privileged instruction on the physical hardware. A trap handler in the VMM then emulates the effect of the privileged instruction, but changes the state of the virtual machine rather than of the physical machine. For example, if the guest operating system attempts to disable interrupts, the VMM emulation of this operation will record in the virtual machine state that interrupts have been disabled, but will leave interrupts enabled on the physical machine. Once emulation of the privileged instruction has completed, the VMM resumes the virtual machine at its next instruction. To the virtual machine, there is no way to determine that the VMM stepped in to emulate the privileged instruction, except by observing timing effects.
Provided that all privileged instructions trap when execution is attempted with insufficient privileges—a condition that is satisfied on so-called “virtualizable” architectures—the VMM can remain passive during guest system execution, except for the brief intervention when a privileged instruction must be emulated. More precisely, most of the time the VMM can use “direct execution” on the physical hardware to execute the virtual instruction stream. Thus, the virtual machine executes at native speed, except for the slowdown resulting from time-sharing the physical machine with other software and the overhead of occasional emulation of a privileged instruction. As a special case, non-privileged (application level) software in the virtual machine will execute at full speed, since it is generally free of privileged instructions.
However, many contemporary processors, including the x86 family, are not virtualizable using this trap-and-emulate technique alone. VMMs for such architectures therefore employ other techniques in addition to direct execution to execute the virtual instruction stream. For example, the VMM for the x86 architecture produced by VMware, Inc., of Palo Alto, Calif., employs direct execution only for the non-privileged code in the virtual machine and transforms privileged code through a binary translation process before it is allowed to execute.
In addition to controlling the instruction stream executed by software in virtual machines, the VMM must also control other resources in order to ensure that the virtual machines remain encapsulated and do not interfere with other software on the system. First and foremost, this applies to I/O devices that are shared between virtual machines, but it also applies to interrupt vectors, which generally must be directed into the VMM (the VMM will conditionally forward interrupts to the virtual machine). Furthermore, the memory management (MMU) functionality must be under control of the VMM in order to prevent the virtual machine from accessing memory belonging to other software on the computer. Yet other resources, some of which may be specific to particular architectures, including the local and global descriptor tables of the x86 architecture, may need to be monitored or adjusted by the VMM.
In one solution employed in the virtualization products of VMware, Inc., the guest (VM) operating system sets up a global descriptor table (GDT) somewhere in its memory. The GDT defines the segments that the guest operating system uses in its execution of operating system and user level code. (For a general description of segments in the x86 architecture, see “Intel Architecture Software Developer's Manual,” vol. 3: “System Programming,” Intel Corporation, 1999.) In this configuration, the guest's GDT is referred to as the “primary” GDT. The VMM then derives a “shadow” GDT from the primary GDT. If the guest operating system executes directly on the hardware, it will load a reference to its primary GDT into the physical processor's GDT register. This, however, could be dangerous when running in a virtual machine, which cannot be allowed to have full control over its GDT. Instead of activating the primary GDT, the VMM therefore loads the shadow GDT when running the guest operating system. Since the VMM controls the shadow GDT, it can confine the guest operating system within the virtual machine boundaries.
The structure of the shadow GDT in the VMware, Inc., system generally follows that of the primary GDT, but with permissions down-graded. For example, a data segment descriptor in the primary GDT will yield a derived data segment descriptor in the shadow GDT. Whereas the primary GDT descriptor can take any form that the guest operating system desires, the shadow GDT descriptor will be restricted by the VMM. In particular, the base and the limit of the primary descriptor may permit access to the entire address space (for example, 0 to 0xfffffff), but the VMM may truncate the limit in the shadow descriptor to confine the guest to a smaller range of addresses (for example, 0 to 0xffbfffff). This truncation allows the VMM to remain invisible to guest operating systems since it can reside in the address range inaccessible to guests, in this example, above 0xffbfffff.
For correctness, the VMM must propagate modifications from the primary structures to the shadow tables as soon as the guest modifies the primary structures. A convenient way to implement this is to have the VMM write-protect the range of memory where the primary structure resides. Any attempt by the guest to write to the primary structure will then result in a write protection fault (page fault) that the VMM can catch. The VMM then temporarily lifts the write protection, executes the write (perhaps using a single-stepping facility), reestablishes the write-protection, and finally propagates the modification from the primary structure to the shadow structure. The guest then resumes operation in the new context established by the modification of the primary structure.
Abstractly, one may say that the VMM establishes a “write trace” on the primary structure. The write trace provides notification to the VMM whenever the guest attempts to modify the primary structure. This in turn gives the VMM the opportunity to control the modification to the primary structure, and rederive the shadow structure from the primary. In other situations, primarily involving memory-mapped devices, the VMM may use “read traces” to get notification whenever the guest reads a memory location.
To further illustrate the utility of traces, consider the case in which a VMM uses binary translation to execute some of the guest instruction stream. Using the terminology introduced above, the guest instruction stream that is given as input to the binary translator is a primary structure, from which the translator derives a shadow, that is, secondary structure (the translated code). As with any primary/shadow arrangement, if the guest modifies the primary structure (that is, the guest uses self-modifying code), then the VMM must retranslate or invalidate the shadow structure. A convenient way to trigger retranslation or invalidation is for the VMM to apply a write trace to all guest code that has been processed by the binary translator. Any “write” to any memory location containing the guest code will then give rise to a trace fault being issued to the VMM. The VMM then handles this fault by retranslating or invalidating the (possibly) altered code.
From all of this it may be understood that the use of read and write traces to monitor guest access to primary data and code structures is central to the implementation of uniprocessor VMMs. The manner in which different types of traces are established in virtualized, uniprocessor systems, the underlying data structures that make tracing possible, and the concept of sensing faults in response to trace events, are in general well known in computer science. What is needed, however, is a system and method of operation that provides a generalization of traces in order to facilitate the implementation of multiprocessor VMMs. This invention provides such as system and method.