1. Field of the Invention
This invention relates to the field of virtual computers, especially networking in virtualized systems.
2. Background Art
The advantages of virtual machine technology have become widely recognized. Among these advantages is the ability to run multiple virtual machines on a single host platform. This makes better use of the capacity of the hardware, while still ensuring that each user enjoys the features of a “complete,” isolated computer. Depending on how it is implemented, virtualization also provides greater security since it can isolate potentially unstable or unsafe software so that it cannot adversely affect the hardware state or system files.
Virtual Computers
As is well known in the field of computer science, a virtual machine (VM) is a software abstraction—a “virtualization”—of an actual physical computer system. FIG. 1 illustrates the main components of one type of virtualized computer system. As with any other computer system, a virtualized computer runs on a system hardware platform 100, which includes one or more processors (CPUs) 110, system memory 140, and at least one storage device, which will typically be a disk 114. The system memory 140 will typically be some form of high-speed RAM, whereas the disk (one or more) will typically be a non-volatile (“persistent”) mass storage device. The hardware 100 will also include other conventional mechanisms such as a memory management unit MMU 116, and one or more conventional network connection device(s) such as a network adapter or network interface card 172—“NIC”—for transfer of data between the various components of the system and one or more external systems such as servers 710 via a bus or network 700.
In the system shown in FIG. 1, a system software layer 200 includes a host operating system 220 or some analogous software that performs the hardware-interface, resource-allocating and control functions of an operating system, which will include drivers 222 as needed for various connected devices 400. A display device and input devices such as a keyboard, mouse, trackball, touchpad, etc., (not shown) are usually also included among the devices for obvious purposes. The disk(s) 114 and the NIC(s) 172 are of course also devices, but are shown separately because of their relative importance. The operating system (OS) 220 may be any known OS and will therefore have all typical components. User-level applications 300 may be installed to run on the host operating system 220.
One or more virtual machines 500 are installed to run on the hardware platform 100. The VMs either alone or in combination with respective, supporting virtual machine monitors VMMs (see below), are referred to here as “guests;” only one guest is shown, for simplicity. Two configurations are in general use—a “hosted” configuration, illustrated in FIG. 1, in which an existing, general-purpose operating system (OS) 220 forms a “host” OS that is used to perform certain I/O operations; and a non-hosted configuration, illustrated in FIG. 2, in which a kernel 800 customized to support virtual computers takes the place of the conventional operating system. Of course, the kernel could be considered to be a host, but the configuration is often referred to as being “non-hosted” simply to highlight that the VM and VMMs have specialized system-level support as opposed to relying on existing, stock operating systems. The main components of these two configurations are outlined below.
Each VM 500 will have both virtual system hardware 501 and guest system software 502. The virtual system hardware 501 typically includes at least one virtual CPU 510, virtual system memory 512, at least one virtual disk 514, and one or more virtual devices 540. Where the VM is to communicate via the network, it will also have at least one virtual NIC (572). All of the virtual hardware components of the VM may be implemented in software using known techniques to emulate the corresponding physical components.
The guest system software 502 includes a guest operating system 520, which may simply be a copy of a conventional operating system. As with any other operating system, the guest operating system will have a body of code that performs its core functions; this body of code is typically referred to as the OS “kernel.” Along with the kernel, an operating system such as those in the Windows family will typically expose various features to applications running on it. For example, at least one application programming interface (API) is usually made available to applications so that they can access and communicate with corresponding features and request the operating system to perform certain built-in functions. On the other “side,” drivers are usually installed as needed into the operating system to allow the operating system to correctly communicate with both physical and logical (and thus also virtual) devices. Since the operating system does not “know” what the device is, a driver may also be installed to enable communication between the operating system and other software entities as well; this possibility is exploited in this invention. In FIG. 1, drivers 522 are shown installed in the guest OS 520.
If the VM is suitably designed, then it will not be apparent to the user that any applications 503 running within the VM are running indirectly, that is, via the guest OS 520 and virtual processor(s) 510. Applications 503 running within the VM will act just as they would if run on a “real” computer, except for a decrease in running speed that will be noticeable only in exceptionally time-critical applications. Executable files will be accessed by the guest OS 520 from the virtual disk or virtual memory, which will simply be portions of the actual physical disk or memory allocated to that VM. Once an application is installed within the VM, the guest OS retrieves files from the virtual disk just as if they had been pre-stored as the result of a conventional installation of the application. The design and operation of virtual machines are well known in the field of computer science.
Some interface is usually required between a VM and the underlying host platform (in particular, the CPU(s) 110), which is responsible for actually executing VM-issued instructions and transferring data to and from the actual memory 140 and storage devices 114. A common term for this interface is a “virtual machine monitor” (VMM), shown as component 600. A VMM is usually a thin piece of software that runs directly on top of an intermediate host, or directly on the hardware, and virtualizes at least some of the resources of the physical host machine. The interface exported to the VM is then the same as the hardware interface of the machine (or at least of some machine), so that the guest OS 520 cannot determine the presence of the VMM.
The VMM 600 also usually tracks and either forwards (to some form of operating system) or itself schedules and handles all requests by its VM for machine resources, as well as various faults and interrupts. A mechanism known in the art as an interrupt or exception handler 630 is therefore included in the VMM. As is well known, such an interrupt/exception handler normally includes an interrupt descriptor table (IDT), or some similar table, which is typically a data structure that uses information in the interrupt signal to point to an entry address for a set of instructions that are to be executed when the interrupt/exception occurs.
As mentioned above, depending on how the VM is configured, the VMM may be kept transparent to the VM, and thus also to the user of applications running in the VM. Total transparency of VMM and the underlying supporting components is not usually maintained or even desirable in all virtualized systems, however; rather it may be advantageous in some cases, sometimes known as “para-virtualization” systems, for the guest OS to be provided with an explicit interface to the VMM. In such systems, the VMM is sometimes referred to as a “hypervisor.” This invention, for example, uses a special driver (vmxnet 524, described below) within the guest OS 520 to enable certain features.
The VM and VMM are shown in the figures as separate components for the sake of clarity. Together, each VM/VMM pair may be considered to form a single “virtual computer” which may in turn be considered to be the “guest.” The term “guest” is used here, however, to refer to the VM and its various components, although this choice of terminology is made for convenience and not by way of exclusive definition or limitation. There may be several VM/VMM pairs running on a common host; a single VM/VMM pair 500/600 is shown in FIGS. 1 and 2 for simplicity.
Moreover, the various virtualized hardware components such as the virtual CPU(s) 510, the virtual memory 512, the virtual disk 514, and the virtual device(s) 540 are shown as being part of the VM 500 for the sake of conceptual simplicity—in actual implementations these “components” are usually constructs or emulations exposed to the VM by the VMM, for example, as emulators 640. One advantage of such an arrangement is that the VMM may be set up to expose “generic” devices, which facilitate VM migration and hardware platform-independence.
Hosted Virtual Computers
The configuration illustrated in FIG. 1 is used in the Workstation product of VMware, Inc., of Palo Alto, Calif. In this configuration, the VMM 600 is co-resident at system level with the host operating system 220 such that both the VMM and the host OS 220 can independently modify the state of the host processor. However, the VMM calls into the host OS via a special one of the drivers 222 and a dedicated one of the user-level applications 300 to have the host OS 220 perform certain I/O operations of behalf of the VM. The virtual computer in this configuration is thus hosted in that it runs on an existing host hardware platform 100 together with an existing host OS 220. A hosted virtualization system of the type illustrated in FIG. 1 is described in U.S. Pat. No. 6,496,847 (Bugnion, et al., “System and Method for Virtualizing Computer Systems,” 17 Dec. 2002), which is incorporated here by reference.
Non-hosted Virtual Computers
In other, “non-hosted” virtualized computer systems, a dedicated kernel 800 takes the place of and performs the conventional functions of the host OS, and virtual computers run on the kernel. FIG. 2 illustrates such a configuration, with a kernel 800 that serves as the system software layer for the VM/VMM 500/600 pairs, only one of which is shown for the sake of simplicity. Compared with a system in which VMMs run directly on the hardware platform 100, use of a kernel offers improved performance because it can be co-developed with the VMMs and be optimized for the characteristics of a workload consisting of VMMs (and their supported VMs). Moreover, a kernel can also be optimized for I/O operations and it facilitates provision of services that extend across multiple VMs (for example, for resource management). The ESX Server product of VMware, Inc., has such a configuration.
At boot-up time, an existing operating system 220 (which may be of the same type as the host OS 220 in the configuration of FIG. 1) may be at system level and the kernel 800 may not yet even be operational within the system. In such case, one of the functions of the OS 220 may be to make it possible to load the kernel 800, after which the kernel runs on the native hardware 100 and manages system resources using such components as various loadable modules and drivers 810, a memory management unit 818, at least one interrupt and exception handler 815, etc.
In effect, the kernel, once loaded, displaces the OS 220. Thus, the kernel 800 may be viewed either as displacing the OS 220 from the system level and taking this place itself, or, equivalently, as residing at a “sub-system level.” When interposed between the OS 220 and the hardware 100, the kernel 800 essentially turns the OS 220 into an “application,” which has access to system resources only when allowed by the kernel 800. The kernel then schedules the OS 220 as if it were any other component that needs to use system resources.
The OS 220 may also be included to allow applications 300 unrelated to virtualization to run; for example, a system administrator may need such applications to monitor the hardware 100 or to perform other administrative routines. The OS 220 may thus be viewed as a “console” OS (COS) or “service console.” In such implementations, the kernel 800 preferably also includes a remote procedure call (RPC) mechanism and/or a shared memory area to enable communication, for example, between the VMM 600 and any applications 300 installed to run on the COS 220.
The console OS 220 in FIG. 2 is labeled the same as the host OS 220 in FIG. 1. This is to illustrate that, usually, at most only minor modifications need to be made to the OS 220 kernel in order to support either the host and non-hosted virtualized computers. In fact, at least in the virtualization products of VMware, Inc., “off-the-shelf” commodity operating systems such as Linux and Microsoft Windows may be used as the host or console operating systems.
The kernel 800 handles not only the various VM/VMMs 500/600, but also any other applications running on the kernel, as well as the console OS 220 and even the hardware CPU(s) 110, as entities that can be separately scheduled. Each schedulable entity may be referred to as a “world,” which contains a thread of control, an address space, machine memory, and handles to the various device objects that it is accessing. Worlds, represented in FIG. 2 within the kernel 800 as module 812, are stored in a portion of the memory space controlled by the kernel. Each world also has its own task structure, and usually also a data structure for storing the hardware state currently associated with the respective world.
There will usually be different types of worlds: For example, one or more system worlds may be included, as well as idle worlds, one per CPU. Another world would be a console world associated with the COS 420. Each virtual computer (VM/VMM pair) would also constitute a world.
Binary Translation Vs. Direct Execution
As is known, for example, from U.S. Pat. No. 6,397,242 (Devine, et al., 28 May 2002), which is incorporated here by reference, some virtualization systems allow VM instructions to run directly (in “direct execution” mode) on the hardware CPU(s) when possible. When necessary, however, VM execution is switched to the technique known as “binary translation,” during which the VM is running in the VMM and the VM instructions are converted—translated—into a different instruction or instruction sequence, for example, to enable execution at a safe privilege level. The VMM 600 is therefore shown in FIG. 1 (and assumed in FIG. 2) with a direct execution engine 610, a binary translator 612, and a translation cache 613 which holds the sequences of translated instructions; the VMM will generally also include these components in non-hosted systems.
In the system described in U.S. Pat. No. 6,397,242, for the sake of speed, VM instructions are normally allowed to execute directly. The privilege level of the VM is, however, set such that the hardware platform does not execute VM instructions that require a greater privilege level than the VM is set at. Instead, attempted execution of such an instruction causes the platform to issue a fault, which the VMM handles in part by switching VM execution to binary translation. Direct execution is then resumed at a safe point in the VM instruction stream. This dual-execution mode feature may be used in both hosted and non-hosted configurations of the virtualized computer system.
Virtual and Physical Memory
The address space of the memory 140 is usually partitioned into pages, regions, or other analogous allocation units. Applications address the memory 140 using virtual addresses (VAs), each of which typically comprises a virtual page number (VPN) and an offset into the indicated page. The VAs are then mapped to physical addresses (PAs), each of which similarly comprises a physical page number (PPN) and an offset, and which is actually used to address the physical memory 140. The same offset is usually used in both a VA and its corresponding PA, so that only the VPN needs to be converted into a corresponding PPN.
The concepts of VPNs and PPNs, as well as the way in which the different page numbering schemes are implemented and used, are described in many standard texts, such as “Computer Organization and Design: The Hardware/Software Interface,” by David A. Patterson and John L. Hennessy, Morgan Kaufmann Publishers, Inc., San Francisco, Calif., 1994, pp. 579-603 (chapter 7.4 “Virtual Memory”). Similar mappings are used in region-based architectures or, indeed, in any architecture where relocatability is possible.
An extra level of addressing indirection is typically implemented in virtualized systems in that a VPN issued by an application 503 in the VM 500 is remapped twice in order to determine which page of the hardware memory is intended. The first mapping is provided by a mapping module 523 within the guest OS 520, which translates the guest VPN (GVPN) into a corresponding guest PPN (GPPN) in the conventional manner. The guest OS therefore “believes” that it is directly addressing the actual hardware memory, but in fact it is not.
Of course, a valid address to the actual hardware memory 140 must ultimately be generated. A memory management module 605, located typically in the VMM 600, therefore performs the second mapping by taking the GPPN issued by the guest OS 520 and mapping it to a hardware (or “machine”) page number PPN that can be used to address the hardware memory 140. This GPPN-to-PPN mapping may instead be done in the main system-level software layer (such as in a mapping module in a memory management unit in the kernel 800), depending on the implementation. From the perspective of the guest OS, the GVPN and GPPN might be virtual and physical page numbers just as they would be if the guest OS 520 were the only OS in the system. From the perspective of the system software, however, the GPPN is a page number that is then mapped into the physical memory space of the hardware memory as a PPN.
The addressable space of the disk(s) 114, and therefore also of the virtual disk(s) 514, is similarly subdivided into separately identifiable portions such as blocks or sectors, tracks, cylinders, etc. In general, applications do not directly address the disk; rather, disk access and organization are tasks reserved to the operating system, which follows some predefined file system structure. When the guest OS 520 wants to write data to the (virtual) disk 514, the identifier used for the intended block, etc., is therefore also converted into an identifier into the address space of the physical disk 114. Conversion may be done within whatever system-level software layer that handles the VM, either the VMM 600, the host OS 220 (under direction of the VMM), or in the kernel 800.
Problem of Network Performance
One of the most challenging parts of kernel-based virtualization systems such as the ESX product of VMware, Inc. of Palo Alto, Calif., illustrated in simplified form in FIG. 2, is providing good networking performance. Previous work in improving networking performance has focused on a NIC driver in the guest VM 500. This means that all protocol processing is done by the guest OS 520. Unfortunately, there are high virtualization overheads associated with running the guest networking code. This limits the ability to close the performance gap between native networking and virtual machine networking. The most interesting protocol is TCP/IP (Transmission Control Protocol/Internet Protocol) since this is the dominant protocol, although similar problems will generally exist with other protocols as well.
With the introduction of 10 Gigabit Ethernet, NIC manufacturers are providing a TCP/IP offload engine (TOE) on the NIC. This allows an operating system to offload most TCP/IP protocol processing to the NIC. This greatly reduces the CPU overhead associated with this processing.
These TOEs provide an opportunity to greatly improve networking performance—there should be a reduction in virtualization overhead for networking because the protocol stack will be running in hardware. Unfortunately, TOEs are not available today on 100 Mbit or 1 gigabit cards. In addition, even with a TOE in hardware there may still be significant virtualization overheads associated with using the TOE because of the work done in the guest OS kernel before it hands data off to the TOE.
What is needed is a software TOE that can be used to improve virtual machine networking performance. This invention meets this need.