1. Field of the Invention
This invention relates generally to computer virtualization and, in particular, to a method and system for efficiently virtualizing completions of input/output events by a virtual device.
2. Description of the Related Art
The advantages of virtual machine technology have become widely recognized. Among these advantages is the ability to run multiple virtual machines on a single host platform. This makes better use of the capacity of the hardware, while still ensuring that each user enjoys the features of a “complete,” isolated computer. Depending on how it is implemented, virtualization also provides greater security, since the virtualization can isolate potentially unstable or unsafe software so that it cannot adversely affect the hardware state or system files required for running the physical (as opposed to virtual) hardware.
As is well known in the field of computer science, a virtual machine (VM) is an abstraction—a “virtualization”—of an actual physical computer system. FIG. 1 shows one possible arrangement of a computer system 700 that implements virtualization. A virtual machine (VM), which in this system is the guest 200, is installed on a “host platform,” or simply “host,” which will include a hardware platform 100 and one or more layers or co-resident components comprising system-level software, such as an operating system or similar kernel, or a virtual machine monitor or hypervisor (see below), or some combination of these.
Each VM 200 will typically have both virtual system hardware and guest system software. The virtual system hardware typically includes at least one virtual CPU 210, virtual system memory 230, at least one virtual disk 240, and one or more virtual devices 270. Note that a disk—virtual or physical—is also a “device,” but is usually considered separately because of the important role of the disk. All of the virtual hardware components of the VM may be implemented in software using known techniques to emulate the corresponding physical components. The guest system software includes a guest operating system (OS) 220 and drivers 224 as needed for the various virtual devices 270.
If the VM 200 is properly designed, applications 260 running on the VM will function the same as they would if run on a “real” computer, even though the applications are running indirectly, that is via the guest OS 220 and virtual processor(s) 210. Executable files will be accessed by the guest OS from the virtual disk 240 or virtual memory 230, which will simply be portions of the actual physical disk 140 or memory 130 allocated to that VM. Once an application is installed within the VM, the guest OS retrieves files from the virtual disk just as if the files had been pre-stored as the result of a conventional installation of the application. The design and operation of virtual machines are well known in the field of computer science.
Some interface is usually required between a VM 200 and the underlying host platform (in particular, the hardware CPU(s) 110), which is responsible for actually executing VM-issued instructions and transferring data to and from the hardware memory 130 and storage devices 140. A common term for this interface is a “virtual machine monitor” (VMM), shown as component 300. A VMM is usually a software component that runs directly on top of a host, or directly on the hardware, and virtualizes at least some of the resources of the physical host machine so as to export some hardware interface to the VM.
The various virtualized hardware components in the VM, such as the virtual CPU(s) 210, the virtual memory 230, the virtual disk 240, and the virtual device(s) 270, are shown as being part of the VM 200 for the sake of conceptual simplicity. In actuality, these “components” are usually implemented as emulations included in the VMM. One advantage of such an arrangement is that the VMM may be set up to expose “generic” devices, which facilitate VM migration and hardware platform independence.
In fully virtualized systems, the guest OS 220 cannot determine the presence of the VMM 300 and does not access hardware devices directly. One advantage of full virtualization is that the guest OS may then often simply be a copy of a conventional operating system. Another advantage is that the system provides complete isolation of a VM 200 from other software entities in the system (in particular, from other VMs), if desired. Because such a VM (and thus the user of applications running in the VM) cannot usually detect the presence of the VMM, the VMM and the VM may be viewed as together forming a single virtual computer.
In contrast to a fully virtualized system, the guest OS in a so-called “paravirtualized” system is modified to support virtualization, such that it not only has an explicit interface to the VMM, but is sometimes also allowed to access at least one host hardware resource directly. In short, virtualization transparency is sacrificed to gain speed or to make it easier to implement the VMM that supports the para-virtualized machine. In such para-virtualized systems, the VMM is sometimes referred to as a “hypervisor.”
In addition to the distinction between full and partial (para-) virtualization, two arrangements of intermediate system-level software layer(s) are in general use—a “hosted” configuration and a non-hosted configuration (which is shown in FIG. 1). In a hosted virtualized computer system, an existing, general-purpose operating system forms a “host” OS that is used to perform certain input/output (I/O) operations, alongside and sometimes at the request of the VMM. The Workstation product of VMware, Inc., of Palo Alto, Calif., is an example of a hosted, virtualized computer system, which is also explained in U.S. Pat. No. 6,496,847 (Bugnion, et al., “System and Method for Virtualizing Computer Systems,” 17 Dec. 2002).
In many cases, it may be beneficial to deploy VMMs on top of a software layer—a kernel 600—constructed specifically for this purpose. This configuration is frequently referred to as being “non-hosted.” Compared with a system in which VMMs run directly on the hardware platform, use of a kernel offers greater modularity and facilitates provision of services (for example, resource management) that extend across multiple virtual machines. Compared with a hosted deployment, a kernel may offer greater performance because it can be co-developed with the VMM and be optimized for the characteristics of a workload consisting primarily of VMMs. The kernel 600 also handles any other applications running on the kernel that can be separately scheduled, as well as any temporary “console” operation system 420, if included, used for booting the system as a whole and for enabling certain user interactions with the kernel.
Note that the kernel 600 is not the same as the kernel that will be within the guest operating system 220—as is well known, every operating system has its own kernel. Note also that the kernel 600 is part of the “host” platform of the VM/VMM as defined above even though the configuration shown in FIGS. 1 and 2 is commonly termed “non-hosted.” The difference in terminology is one of perspective and definitions that have evolved in the art of virtualization.
Performance Considerations with Device Emulation
In a VM environment, such as within the VM 200 of FIG. 1, the functionality of a physical device is emulated. The device emulation may or may not rely on the “real” presence of the corresponding physical device within the hardware system 100. For example, the VMM 300 of FIG. 1 may include a device emulator 330 that emulates a standard Small Computer System Interface (SCSI) disk such that the virtual disk 240 appears to the VM 200 to be a standard SCSI disk connected to a standard SCSI adapter, although the underlying actual device 140 is another type of physical hardware, such as an IDE hard disk. In this case, a standard SCSI driver is installed into the guest operating system 220 as one of the drivers 224. The device emulator 330 then interfaces with the driver 224 and handles disk operations for the VM 200. The device emulator 330 converts the disk operations from the VM to corresponding disk operations for the physical disk 140. As other examples, CDROM emulation or floppy drive emulation may manipulate image files that are stored on the physical disk 140.
The drivers 224 within the VM 200 will usually be subject to the same limitations as the drivers within a computer system that does not employ virtualization. For example, device drivers for commercial operating systems are often developed under severe market pressure. The purpose of the driver software is to drive target hardware devices installed in the common computer systems available at the time of driver development. Most drivers perform their tasks well, but typically lack any “unnecessary” generalities, so that the device typically cannot perform significantly faster than the initial target speed. In fact, the device may not perform correctly when running on an unusually fast or unusually slow processor. Therefore, when a faster hardware device or faster processor becomes available, it is shipped with a new driver version.
To release CPU resources during the performance of I/O events, most peripheral devices have evolved to operate in an asynchronous interrupt-driven fashion. A processor thus initiates an I/O operation by writing device-specific values into device registers. In FIG. 1, for example, a CPU 110 will initiate an I/O operation for a peripheral device 170 by writing the device-specific values into the device registers. The processor can then operate with respect to other tasks that do not depend on the completion of the pending I/O operation. The device 170 then carries out the specific task and interrupts the processor upon reaching I/O completion. Then, the device driver again takes over and performs the necessary post-completion software tasks. The operating system eventually arranges for the requesting process to obtain the newly received data. The same sequence of events will occur within the VM 200, where the virtual CPU(s) 210 cooperate with the corresponding driver 224 and the guest operating system 220 to control virtual I/O events for a virtual device 270.
A driver 224 of a device 270 implements a state machine to drive the underlying device. State transitions of the state machine rely on the physical characteristics of the device. Both device latencies and throughputs must be modeled accurately in order to ensure reliable operation of the driver in the guest OS 220. If virtual hardware exhibits vastly different timing characteristics than the physical hardware for which the driver was developed and tested, the driver may malfunction, which would affect the reliability of the entire guest OS. For instance, an access time of an ATAPI IDE CDROM may be 75 milliseconds (ms) to 100 ms. The CDROM emulation may, however, actually be accessing data stored on a device such as a hard drive (disk 140) with an access time ten times faster than that of the emulated device. (In these cases, the device that actually stores the data is said to “back” the virtual device.) The original (physical) CDROM device parameters must then be preserved to ensure the correct operation of the CDROM driver in the guest operating system.
FIG. 4 illustrates this point: Assume the same example, namely, that the VM is configured with a virtual CDROM device 270, which will correspond to (that is, be a virtualization of) some physical CDROM device 170. When the VM makes an I/O request directed to the virtual CDROM device, it is in fact requesting some data stored on (or to be written to) what it believes is a “real” CDROM disk 171, but is in fact simply a software construct corresponding to a physical CDROM disk. As long as the VM gets the data that would have been on the physical CDROM disk 171, in the expected format, then the VM does not need to know that the data in fact is coming from the backing device, which, in the illustration, is the hardware disk 140. Consequently, the data assumed to be contained on the physical CDROM disk 171 need not be on a medium in a physical drive at all; in fact, it is not even necessary for the physical CDROM disk 171 ever to have existed, as long as the data such a CDROM disk would hold is stored and made available to the VM in the correct form. Rather, the data on the assumed physical CDROM disk can be stored, for example, as an ISO image file 141, in the backing device, such as disk 140, or possibly even cached in memory.
When the VM issues an I/O request to the virtual CDROM device 270, the VMM, in cooperation with the host, will convert this I/O request into a corresponding I/O request against the backing device. For example, an I/O request issued to the virtual CDROM drive 270 could be converted into a file request against the ISO image file 141 on the physical hard disk 140. Now assume that an I/O request takes 80 ms to complete using the assumed physical CDROM drive 170. The guest OS will then assume that any I/O request it issues to the virtual CDROM drive 270 will also take 80 ms to complete. Assume, however, that the access time for requested information in the ISO image file 141 on the physical hard disk (or other backing device) is only 8 ms.
Thus, a physical CDROM drive 170 would be relatively slow; the VM thinks that the virtual CDROM drive 270 is equally slow; but the actual backing device used to complete I/O requests is much faster. This means that it would be possible to complete the I/O request in only 8 ms, since there is no actual need to retrieve anything from a physical CDROM disk 171 on a physical CDROM drive 170. The problem is that many drivers cannot accept the I/O completion interrupt that early.
Stated more generally, a backing device that completes an I/O event too soon with respect to the anticipated-driver/processor speed may interfere with the on-going driver state transitions of the state machine being implemented by the driver. In the common case, this could result in interrupting driver code while it is not ready to handle an incoming device interrupt, which in turn might lead to data corruption and/or driver malfunction. A similar problem exists when consecutive interrupts are delivered too close in time, since the driver may not be able to recover from servicing the previous interrupt when the subsequent interrupt arrives.
One solution to the interrupt timing problem in the VM context is to place completion interrupts for the completed I/O events that arrive at the faster hardware speed into a delay queue, which can then be used to impose the speed of the relatively slower emulated device. The slower speed is imposed by draining the delay queue at a rate that is consistent with that of the slower emulated device. For instance, continuing with the example above, while emulating a virtual CDROM access, the data may be read from an IDE hard disk backing device within 8 ms. The corresponding I/O completion event will then stay in the delay queue for at least 72 ms, for a total completion time of at least 80 ms, in order to match the virtual CDROM access time. To avoid spinning, the delay queue may be examined after each timer interrupt interval to identify I/O completion events that can be forwarded to the guest operating system.
While the above-described technique operates well for its intended purposes, it is desirable to allow physical device emulation to scale with any advances in the underlying host platform, while maintaining the stability of the guest operating system. It is also desirable to enable virtual devices to outperform the corresponding physical hardware being emulated, if this can be achieved without destabilizing the guest operating system. In other words, existing solutions to this problem essentially force advanced or faster technology to simulate less advanced, slower technology so as not to “outrun” the latter—the greater the difference in latency between the two, the greater is the “waste.” For example, the disadvantage of forced delay is particularly apparent in the context of low latency devices, such as network interface cards (NICs)—Such devices may require smaller delays, on the order of microseconds as compared to the approximately 80 millisecond delay of a CDROM.
Because of the finite host timer resolution of typical VM architectures, the actual average minimal delay will be related to the timer interrupt interval. As a result, there may be a negative impact on the performance of the emulated device. For example, the events corresponding to the arrival of network packets will be communicated to the guest with a latency on the order of milliseconds, instead of microseconds. Increasing the host timer resolution to reduce this latency may be difficult in the hosted context and is likely to adversely affect overall performance when the timer interrupt rate becomes too high.