1. Field of the Invention
This invention relates to the field of computer virtualization, that is, to systems and methods for implementing computers as software running on an underlying host hardware platform.
2. Description of the Related Art
Virtualization has brought many advantages to the world of computers. As is well known in the art, a virtual machine (VM) is a software abstraction—a “virtualization”—of a physical computer system that runs as a “guest” on an underlying “host” hardware platform. As long as a suitable interface is provided between the VM and the host platform, one advantage is that the operating system (OS) in the guest need not be the same as the OS at the system level in the host. For example, applications that presuppose a Microsoft Windows OS can be run in the VM even though the OS used to handle actual I/O, memory management, etc., on the host might be Linux.
It usually requires less than 10% of the processing capacity of a CPU to run a typical application, although usage may peak briefly for certain operations. Virtualization can more efficiently use processing capacity by allowing more than one VM to run on a single host, effectively multiplying the number of “computers” per “box.” Depending on the implementation, the reduction in performance is negligible, or at least not enough to justify separate, dedicated hardware “boxes” for each user.
Still another advantage is that different VMs can be isolated from and completely transparent to one another. Indeed, the user of a single VM will normally be unaware that he is not using a “real” computer, that is, a system with hardware dedicated exclusively to his use. The existence of the underlying host will also be transparent to the VM software itself. The products of VMware, Inc., of Palo Alto, Calif. provide all of these advantages in that they allow multiple, isolated VMs, which may (but need not) have OSs different from each other's, to run on a common hardware platform.
Example of a Virtualized System
FIG. 1 illustrates the main components of a system 10 that supports a virtual machine as implemented in the Workstation product of VMware, Inc. As in conventional computer systems, both system hardware 100 and system software 220 are included. The system hardware 100 includes CPU(s) 102, which may be a single processor, or two or more cooperating processors in a known multiprocessor arrangement. The system hardware also includes system memory 104, one or more disks 106, and some form of memory management unit (MMU) 108. As is well understood in the field of computer engineering, the system hardware also includes, or is connected to, conventional registers, interrupt-handling circuitry, a clock, etc., which, for the sake of simplicity, are not shown in the figure.
The system software 220 either is or at least includes an operating system (OS) 230, which has drivers 240 as needed for controlling and communicating with various devices 110, and usually with the disk 106 as well. Conventional applications 255, if included, may be installed to run on the hardware 100 via the system software 220 and any drivers needed to enable communication with devices.
As mentioned above, the virtual machine (VM) 300—also known as a “virtual computer”—is a software implementation of a complete computer system. In the VM, the physical system components of a “real” computer are emulated in software, that is, they are virtualized. Thus, the VM 300 will typically include virtualized (“guest”) system hardware 301, which in turn includes one or more virtual CPUs 302 (VCPU), virtual system memory 304 (VMEM), one or more virtual disks 306 (VDISK), and one or more virtual devices 310 (VDEVICE), all of which are implemented in software to emulate the corresponding components of an actual computer.
The VM's system software 320 includes a guest operating system 330, which may, but need not, simply be a copy of a conventional, commodity OS, as well as drivers 340 (DRVS) as needed, for example, to control the virtual device(s) 310. Of course, most computers are intended to run various applications, and a VM is usually no exception. Consequently, by way of example, FIG. 1 illustrates one or more applications 355 installed to run on the guest OS 330; any number of applications, including none at all, may be loaded for running on the guest OS, limited only by the requirements of the VM. The guest system software 320, including the guest OS 330, and the guest applications 355 are referred to collectively herein as the “guest software.”
Note that although the hardware “layer” 301 will be a software abstraction of physical components, the VM's system software 320 may be the same as would be loaded into a hardware computer. The modifier “guest” is used here to indicate that the VM, although it acts as a “real” computer from the perspective of a user, is actually just computer code that is executed on the underlying “host” hardware and software platform 100, 220. Thus, for example, I/O to the virtual device 310 will actually be carried out by I/O to the hardware device 110, but in a manner transparent to the VM.
If the VM is properly designed, then the applications (or the user of the applications) will preferably not “know” that they are not running directly on “real” hardware. Of course, all of the applications and the components of the VM are instructions and data stored in memory, just as any other software. The concept, design and operation of virtual machines are well known in the field of computer science. FIG. 1 illustrates a single VM 300 merely for the sake of simplicity, to illustrate the structure and operation of that single VM; in many installations, there will be more than one VM installed to run on the common hardware platform; all will typically have essentially the same general structure, although the individual components need not be identical.
Some interface is usually required between the VM 300 and the underlying “host” hardware 100, which is responsible for actually executing VM-related instructions and transferring data to and from the actual, physical memory 104. One advantageous interface between the VM and the underlying host system is often referred to as a virtual machine monitor (VMM), also known as a virtual machine “manager.” Virtual machine monitors have a long history, dating back to mainframe computer systems in the 1960s. See, for example, Robert P. Goldberg, “Survey of Virtual Machine Research,” IEEE Computer, June 1974, p. 34-45.
A VMM is usually a relatively thin layer of software that runs directly on top of a host, such as the system software 220, or directly on the hardware, and virtualizes the resources of the (or some) hardware platform. The VMM will typically include at least one device emulator 410, which may also form the implementation of the virtual device 310. The interface exported to the respective VM is usually such that the guest OS 330 cannot determine the presence of the VMM. The VMM also usually tracks and either forwards (to the host OS 230) or itself schedules and handles all requests by its VM for machine resources, as well as various faults and interrupts. FIG. 1 therefore illustrates an interrupt (including fault) handler 450 within the VMM. The general features of VMMs are well known and are therefore not discussed in further detail here.
In FIG. 1, a single VMM 400 is shown acting as the interface for the single VM 300. It would also be possible to include the VMM as part of its respective VM, that is, in each virtual system. Although the VMM is usually completely transparent to the VM, the VM and VMM may be viewed as a single module that virtualizes a computer system. The VM and VMM are shown as separate software entities in the figures for the sake of clarity. Moreover, it would also be possible to use a single VMM to act as the interface for more than one VM, although it will in many cases be more difficult to switch between the different contexts of the various VMs (for example, if different VMs use different guest operating systems) than it is simply to include a separate VMM for each VM. This invention works with all such VM/VMM configurations.
In some configurations, the VMM 400 runs as a software layer between the host system software 220 and the VM 300. In other configurations, such as the one illustrated in FIG. 1, the VMM runs directly on the hardware platform 100 at the same system level as the host OS. In such case, the VMM may use the host OS to perform certain functions, including I/O, by calling (usually through a host API—application program interface) the host drivers 240. In this situation, it is still possible to view the VMM as an additional software layer inserted between the hardware 100 and the guest OS 330. Furthermore, it may in some cases be beneficial to deploy VMMs on top of a thin software layer, a “kernel,” constructed specifically for this purpose.
FIG. 2 illustrates yet another implementation, in which a kernel 650 takes the place of and performs the conventional functions of the host OS. Compared with a system in which VMMs run directly on the hardware platform, use of a kernel offers greater modularity and facilitates provision of services that extend across multiple virtual machines (for example, resource management). Compared with the hosted deployment, a kernel may offer greater performance because it can be co-developed with the VMM and be optimized for the characteristics of a workload consisting of VMMs.
As used herein, the “host” OS therefore means either the native OS 230 of the underlying physical computer, or whatever system-level software handles actual I/O operations, takes faults and interrupts, etc. for the VM. The invention may be used in all the different configurations described above.
Memory Mapping and Address Terminology
In most modern computers, memory is addressed as units known as “pages,” each of which is identified by a corresponding page number. The most straightforward way for all components in a computer to uniquely identify a memory page would be for them all simply to use a common set of page numbers. This is almost never done, however, for many well-known reasons. Instead, user-level software normally refers to memory pages using one set of identifiers, which is then ultimately mapped to the set actually used by the underlying hardware memory.
When a subsystem requests access to the hardware memory 104, for example, the request is usually issued with a “virtual address,” since the memory space that the subsystem addresses is a construct adopted to allow for much greater generality and flexibility. The request must, however, ultimately be mapped to an address that is issued to the actual hardware memory. This mapping, or translation, is typically specified by the operating system (OS), which includes some form of memory management module 208 included for this purpose. The OS thus converts the “virtual” address (VA), in particular, the virtual page number (VPN) of the request, into a “physical” address (PA), in particular, a physical page number (PPN), that can be applied directly to the hardware. (The VA and PA have a common offset from a base address, so that only the VPN needs to be converted into a corresponding PPN.)
When writing a given word to a virtual address in memory, the processor breaks the virtual address into a virtual page number (higher-order address bits) plus an offset into that page (lower-order address bits). The virtual page number (VPN) is then translated using mappings established by the OS into a physical page number (PPN) based on a page table entry (PTE) for that VPN in the page table associated with the currently active address space. The page table will therefore generally include an entry for every VPN. The actual translation may be accomplished simply by replacing the VPN (the higher order bits of the virtual address) with its PPN mapping, leaving the lower order offset bits the same.
To speed up virtual-to-physical address translation, a hardware structure known as a translation look-aside buffer (TLB) is normally included, for example, as part of a hardware memory management unit (MMU) 108. The TLB contains, among other information, VPN-to-PPN mapping entries at least for VPNs that have been addressed recently or frequently. Rather than searching the entire page table, the TLB is searched first instead. If the current VPN is not found in the TLB, then a “TLB miss” occurs, and the page tables in memory are consulted to find the proper translation, and the TLB is updated to include this translation. After the TLB miss fault is handled, the same memory access is attempted again, and this time, the required VPN-to-PPN mapping is found in the TLB. The OS thus specifies the mapping, but the hardware MMU 108 usually actually performs the conversion of one type of page number to the other. Below, for the sake of simplicity, when it is stated that a software module “maps” page numbers, the existence and operation of a hardware device such as the MMU 108 may be assumed.
The concepts of VPNs and PPNs, as well as the way in which the different page numbering schemes are implemented and used, are described in many standard texts, such as “Computer Organization and Design: The Hardware/Software Interface,” by David A. Patterson and John L. Hennessy, Morgan Kaufmann Publishers, Inc., San Francisco, Calif., 1994, pp. 579-603 (chapter 7.4 “Virtual Memory”). Patterson and Hennessy analogize address translation to finding a book in a library. The VPN is the “title” of the book and the full card catalog is the page table. A catalog card is included for every book in the library and tells the searcher where the book can be found. The TLB is then the “scratch” paper on which the searcher writes down the locations of the specific books he has previously looked up.
An extra level of addressing indirection is typically implemented in virtualized systems in that a VPN issued by an application 355 in the VM 300 is remapped twice in order to determine which page of the hardware memory is intended. A mapping module 308 within the guest OS 330 translates the guest VPN (GVPN) into a corresponding guest PPN (GPPN) in the conventional manner. The guest OS therefore “believes” that it is directly addressing the actual hardware memory, but in fact it is not. Of course, a valid address to the actual hardware memory address must, however, ultimately be used.
An address mapping module 408 in the VMM 400 therefore takes the GPPN issued by the guest OS 330 and maps it to a hardware page number PPN that can be used to address the hardware memory. From the perspective of the guest OS, the GVPN and GPPN are virtual and physical page numbers just as they would be if the guest OS were the only OS in the system. From the perspective of the actual host OS, however, the GPPN is a page number in the virtual address space, that is, a VPN, which is then mapped into the physical memory space of the hardware memory as a PPN. Note that in some literature involving virtualized systems, GVPNs, GPPNs, VPNs and PPNs are sometimes referred to as “VPNs,” “PPNs,” “VPNs” and “MPNs,” respectively, where “MPN” means “machine page number,” that is, the page number used to address the hardware memory. The problem is, though, that “VPN” is then used to mean the virtual page number in both the guest and host contexts, and one must always be aware of the current context to avoid confusion. Regardless of notation, however, the intermediate GPPN→PPN mapping performed by the VMM is transparent to the guest system.
Speed is a critical issue in virtualization—a VM that perfectly emulates the functions of a given computer but that is too slow to perform needed tasks is obviously of little good to a user. Ideally, a VM should operate at the native speed of the underlying host system. In practice, even where only a single VM is installed on the host, it is impossible to run a VM at native speed, if for no other reason than that the instructions that define the VMM must also be executed. Near native speed, is possible, however, in many common applications.
The highest speed for a VM is found in the special case where every VM instruction executes directly on the hardware processor. This would in general not be a good idea, however, because the VM should not be allowed to operate at the greatest privilege level; otherwise, it might alter the instructions or data of the host OS or the VMM itself and cause unpredictable behavior. Moreover, in cross-architectural systems, one or more instructions issued by the VM may not be included in the instruction set of the host processor. Instructions that cannot (or must not) execute directly on the host are typically converted into an instruction stream that can. This conversion process is commonly known as “binary translation.”
U.S. Pat. No. 6,397,242 (Devine, et al., “Virtualization system including a virtual machine monitor for a computer with a segmented architecture”), which is incorporated herein by reference, describes a system in which the VMM includes a mechanism that allows VM instructions to execute directly on the hardware platform whenever possible, but that switches to binary translation when necessary. FIG. 1 shows a direct execution engine 403 and a binary translation engine 405, for performing these respective functions. Combining these techniques allows for the speed of direct execution combined with the security of binary translation.
A virtualization system of course involves more than executing VM instructions—the VMM itself is also a software mechanism defined by instructions and data of its own. For example, the VMM might be a program written in C, compiled to execute on the system hardware platform. At the same time, an application 355 written in a language such as Visual Basic might be running in the VM, whose guest OS may be compiled from a different language.
There must also be some way for the VM to access hardware devices, albeit in a manner transparent to the VM itself. One solution would of course be to include in the VMM all the required drivers and functionality normally found in the host OS 230 to accomplish I/O tasks. Two disadvantages of this solution are increased VMM complexity and duplicated effort—if a new device is added, then its driver would need to be loaded into both the host OS and the VMM. In systems that include a host OS (as opposed to a dedicated kernel such as shown in FIG. 2), a much more efficient method has been implemented by VMware, Inc., in its Workstation product. This method is also illustrated in FIG. 1.
In the system illustrated in FIG. 1, both the host OS and the VMM are installed at system level, meaning that they both run at the greatest privilege level and can therefore independently modify the state of the hardware processor(s). For I/O to at least some devices, however, the VMM may issue requests via the host OS 230. To make this possible, a special driver VMdrv 242 is installed as any other driver within the host OS 230 and exposes a standard API to a user-level application VMapp 500. When the system is in the VMM context, meaning that the VMM is taking exceptions, handling interrupts, etc., but the VMM wishes to use the existing I/O facilities of the host OS, the VMM calls the driver VMdrv 242, which then issues calls to the application VMapp 500, which then carries out the I/O request by calling the appropriate routine in the host OS.
In FIG. 1, the vertical line 600 symbolizes the boundary between the virtualized (VM/VMM) and non-virtualized (host software) “worlds” or “contexts.” The driver VMdrv 242 and application VMapp 500 thus enable communication between the worlds even though the virtualized world is essentially transparent to the host system software 220.
Timers in Virtual Computer Systems
A typical physical computer system contains multiple timers. Thus, FIG. 1 shows a plurality of timers 160. The configuration, operation and use of timers may vary from system to system. In a typical example of the use of a timer, however, the timer may be configured to count from zero up to a specific, final value, counting at a set frequency. When the timer reaches the final value, it may generate a timer interrupt to notify a software routine that it has finished its count, or it may notify a software routine in some other manner. Such a timer interrupt or other notification is referred to as a timer event. The timer may be further configured to reset to zero after it has reached the final value, and begin counting again up to the same final value, implementing a periodic timer. Alternatively, the timer may be configured to simply stop once it reaches the final value, implementing a one-shot timer.
One common use for a timer is to keep track of the time of day or to keep track of how much time has elapsed since some point in time. For such a task, the timer may be set up as a periodic timer; generating timer interrupts at a constant frequency. The interrupts generated by the timer may be counted to keep track of the time with a resolution that is equal to the period of the timer. To further refine the time-keeping function, the timer may also be read from time to time. At an instant when the timer is read, the time may be determined more precisely by adding the time represented by the count of the timer to the time represented by the number of timer interrupts that have occurred. Using this technique, the resolution of the time keeping-function is limited only by the frequency at which the timer is set up to count.
A VM preferably also includes a plurality of timers. Thus, FIG. 1 also shows a plurality of virtual timers (VTIMERS) 360 within the VM 300. The hardware platform of the VM 300, which is exported by the VMM 400, may be a specific platform that is also implemented in physical computer systems. For example, the hardware platform of the VM 300 may be a standard PC (personal computer) platform based on a microprocessor having the Intel IA-32 architecture. When the hardware platform of the VM 300 matches a common physical hardware platform, the virtual timers 360 exported to the VM 300 preferably function substantially the same as the timers in a physical implementation of the hardware platform.
Each virtual timer 360 within the VM 300 may be virtualized by a separate timer emulator, or a single timer emulator may virtualize multiple virtual timers 360. Thus, FIG. 1 also shows a plurality of timer emulators 460 for virtualizing (or emulating or exporting) the virtual timers 360. Each timer emulator 460 preferably emulates its respective virtual timer(s) 360 in a manner such that the guest software in the VM 300 and/or a user of the VM 300 cannot tell that the virtual timer 360 is not a “real” timer, implemented directly in hardware in a physical implementation of the hardware platform. Thus, for example, a virtual timer 360 may preferably also be configured to count from zero up to a specific, final value, counting at a set frequency, either as a periodic timer or as a one-shot timer, and to generate a timer interrupt or other timer event upon reaching the final value. The guest software may preferably use the virtual timers 360 in the same manner that it would use actual hardware timers in a physical implementation of the virtualized hardware platform.
In a physical computer system, the operation of the multiple timers is generally accurate and consistent, and the software executing on physical computer systems generally relies on the accuracy and consistency of these timers. In some situations, the software executing on a physical computer system may rely on the consistency of multiple timers with respect to each other, so that if the timing of one timer is not consistent with the timing of another timer, the software may not function properly. The software creates a dependency between multiple timers and the software expects this dependency to be met. As an example, a first timer may be set to generate timer events at a first frequency and a second timer may be set to generate timer events at a second frequency, with the second frequency being twice the first frequency. If the second timer does not generate timer events twice as often as the first timer, the software may not function properly; or if the timer events of the two timers are inconsistent in some other manner, the software may not function properly.
The VM 300 is preferably designed to execute much of the same software that can execute on a corresponding physical computer system. Thus, the guest software in the VM 300 may also use the virtual timers 360 in a manner that gives rise to a dependency between the multiple virtual timers 360. In this case, there must be some measure of consistency between the multiple virtual timers 360, or the guest software may not function properly. In particular, the guest software may not function in the same manner as the software would function if it were executed in a physical computer system implementing the same hardware platform. For example, if the timer events generated by the virtual timers 360 do not occur in the same relative order within the VM 300 as they would occur if the guest software were running on a physical computer system, the guest software may not run correctly in the VM 300. Also, if the relative intervals between the timer events generated by the virtual timers 360 do not approximate the relative intervals that would occur in a physical computer system, the guest software again may not run correctly.
As one particular example, suppose that the VM 300 implements a common symmetric multiprocessing (SMP) hardware platform, and that the guest OS 330 is a version of Linux that executes on a SMP computer. SMP Linux is critically dependent on a certain level of consistency between the multiple timers in a computer system. If this level of consistency is not met, then the computer system will not operate properly.
What is needed, therefore, is a method and apparatus for emulating multiple timers in a virtual computer system with sufficient consistency to enable guest software to execute properly, preferably in the same manner as the guest software would execute on a comparable physical computer system.