1. Field of the Invention
This invention relates to virtualized computer systems, in particular, to a system and method for improving the performance of network transfers to and from a virtual machine.
2. Description of the Related Art
The advantages of virtual machine technology have become widely recognized. Among these advantages is the ability to run multiple virtual machines on a single host platform. This makes better use of the capacity of the hardware, while still ensuring that each user enjoys the features of a “complete,” isolated computer.
General Virtualized Computer System
As is well known in the field of computer science, a virtual machine (VM) is a software abstraction—a “virtualization”—of an actual physical computer system. FIG. 1 illustrates, in part, the general configuration of a virtual machine 200, which is installed as a “guest” on a “host” hardware platform 100.
As FIG. 1 shows, the hardware platform 100 includes one or more processors (CPUs) 110, system memory 130, and a storage device, which will typically be a disk 140. The system memory will typically be some form of high-speed RAM, whereas the disk (one or more) will typically be a non-volatile, mass storage device. The hardware 100 will also include other conventional mechanisms such as a memory management unit MMU 150, various registers 160, and any conventional network connection device 172 (such as a network adapter or network interface card—“NIC”) for transfer of data between the various components of the system and a bus or network 700, which may be any known public or proprietary bus structure or local or wide-area network such as the Internet, an internal enterprise network, etc.
Each VM 200 will typically include at least one virtual CPU 210, a virtual disk 240, a virtual system memory 230, a guest operating system 220 (which may simply be a copy of a conventional operating system), and various virtual devices 270, in which case the guest operating system (“guest OS”) will include corresponding drivers 224. All of the components of the VM may be implemented in software using known techniques to emulate the corresponding components of an actual computer.
If the VM is properly designed, then it will not be apparent to the user that any applications 260 running within the VM are running indirectly, that is, via the guest OS and virtual processor. Applications 260 running within the VM will act just as they would if run on a “real” computer, except for a decrease in running speed that will be noticeable only in exceptionally time-critical applications. Executable files will be accessed by the guest OS from the virtual disk or virtual memory, which will simply be portions of the actual physical disk or memory allocated to that VM. Once an application is installed within the VM, the guest OS retrieves files from the virtual disk just as if they had been pre-stored as the result of a conventional installation of the application. The design and operation of virtual machines is well known in the field of computer science.
Some interface is usually required between a VM and the underlying host platform (in particular, the CPU), which is responsible for actually executing VM-issued instructions and transferring data to and from the actual memory and storage devices. A common term for this interface is a “virtual machine monitor” (VMM), shown as component 300. A VMM is usually a thin piece of software that runs directly on top of a host, or directly on the hardware, and virtualizes all the resources of the physical host machine. Among other components, the VMM therefore usually includes device emulators 330, which may constitute the virtual devices 270 that the VM 200 addresses. The interface exported to the VM is then the same as the hardware interface of the machine, so that the guest OS cannot determine the presence of the VMM.
The VMM also usually tracks and either forwards (to some form of operating system) or itself schedules and handles all requests by its VM for machine resources, as well as various faults and interrupts. A mechanism known in the art as an exception or interrupt handler 355 is therefore included in the VMM. As is well known, such an interrupt/exception handler normally includes an interrupt descriptor table (IDT), or some similar table, which is typically a data structure that uses information in the interrupt signal to point to an entry address for a set of instructions that are to be executed when the interrupt/exception occurs.
Although the VM (and thus the user of applications running in the VM) cannot usually detect the presence of the VMM, the VMM and the VM may be viewed as together forming a single virtual computer. They are shown in FIG. 1 as separate components for the sake of clarity.
Moreover, the various virtualized hardware components such as the virtual CPU(s) 210, the virtual memory 230, the virtual disk 240, and the virtual device(s) 270 are shown as being part of the VM 200 for the sake of conceptual simplicity—in actual implementations these “components” are usually constructs or emulations exported to the VM by the VMM. For example, FIG. 2 shows a virtual NIC 272 as being within the VM 200. This virtual component, which may be one of the virtual devices 270, may in fact be implemented as one of the device emulators 330 in the VMM. One advantage of such an arrangement is that the VMM may be set up to expose “generic” devices, which facilitate VM migration and hardware platform-independence.
Virtual and Physical Memory
As in most modern computers, the address space of the memory 130 is partitioned into pages (for example, in the Intel x86 architecture), regions (for example, Intel IA-64 architecture) or other analogous units. Applications then address the memory 130 using virtual addresses (VAs), which include virtual page numbers (VPNs). The VAs are then mapped to physical addresses (PAs) that are used to address the physical memory 130. (VAs and PAs have a common offset from a base address, so that only the VPN needs to be converted into a corresponding PPN.) The concepts of VPNs and PPNs, as well as the way in which the different page numbering schemes are implemented and used, are described in many standard texts, such as “Computer Organization and Design: The Hardware/Software Interface,” by David A. Patterson and John L. Hennessy, Morgan Kaufmann Publishers, Inc., San Francisco, Calif., 1994, pp. 579-603 (chapter 7.4 “Virtual Memory”). Similar mappings are used in region-based architectures or, indeed, in any architecture where relocatability is possible.
An extra level of addressing indirection is typically implemented in virtualized systems in that a VPN issued by an application 260 in the VM 200 is remapped twice in order to determine which page of the hardware memory is intended. The first mapping is provided by a mapping module within the guest OS 202, which translates the guest VPN (GVPN) into a corresponding guest PPN (GPPN) in the conventional manner; because the address offsets are the same, this is the same as translating guest physical addresses (GPAs) into actual physical (machine) addresses (PAs). The guest OS therefore “believes” that it is directly addressing the actual hardware memory, but in fact it is not.
Of course, a valid address to the actual hardware memory must ultimately be generated. A memory management module 350, located typically in the VMM 300, therefore performs the second mapping by taking the GPPN issued by the guest OS 220 and mapping it to a hardware (or “machine”) page number PPN that can be used to address the hardware memory 130. This GPPN-to-PPN mapping may instead be done in the main system-level software layer (such as in a mapping module 617 in the kernel 600, as illustrated in FIG. 2 and described further below), depending on the implementation: From the perspective of the guest OS, the GVPN and GPPN might be virtual and physical page numbers just as they would be if the guest OS were the only OS in the system. From the perspective of the system software, however, the GPPN is a page number that is then mapped into the physical memory space of the hardware memory as a PPN.
System Software Configurations in Virtualized Systems
In some systems, such as the Workstation product of VMware, Inc., of Palo Alto, Calif., the VMM is co-resident at system level with a host operating system. Both the VMM and the host OS can independently modify the state of the host processor, but the VMM calls into the host OS via a driver and a dedicated user-level application to have the host OS perform certain I/O operations of behalf of the VM. The virtual computer in this configuration is thus fully hosted in that it runs on an existing host hardware platform and together with an existing host OS.
In other implementations, a dedicated kernel takes the place of and performs the conventional functions of the host OS, and virtual computers run on the kernel. FIG. 1 illustrates a kernel 600 that serves as the system software for several VM/VMM pairs 200/300, . . . , 200n/300n. Compared with a system in which VMMs run directly on the hardware platform, use of a kernel offers greater modularity and facilitates provision of services that extend across multiple VMs (for example, for resource management). Compared with the hosted deployment, a kernel may offer greater performance because it can be co-developed with the VMM and be optimized for the characteristics of a workload consisting of VMMs. The ESX Server product of VMware, Inc., has such a configuration. The invention described below takes advantage of the ability to optimize a kernel as a platform for virtual computers.
A kernel-based virtualization system of the type illustrated in FIG. 1 is described in U.S. patent application Ser. No. 09/877,378 (“Computer Configuration for Resource Management in Systems Including a Virtual Machine”), which issued as U.S. Pat. No. 6,961,941 on Nov. 1, 2005, and which is incorporated here by reference. The main components of this system and aspects of their interaction are, however, outlined below.
At boot-up time, an existing operating system 420 may be at system level and the kernel 600 may not yet even be operational within the system. In such case, one of the functions of the OS 420 may be to make it possible to load the kernel 600, after which the kernel runs on the native hardware 100 and manages system resources. In effect, the kernel, once loaded, displaces the OS 420. Thus, the kernel 600 may be viewed either as displacing the OS 420 from the system level and taking this place itself, or as residing at a “sub-system level.” When interposed between the OS 420 and the hardware 100, the kernel 600 essentially turns the OS 420 into an “application,” which has access to system resources only when allowed by the kernel 600. The kernel then schedules the OS 420 as if it were any other component that needs to use system resources.
The OS 420 may also be included to allow applications unrelated to virtualization to run; for example, a system administrator may need such applications to monitor the hardware 100 or to perform other administrative routines. The OS 420 may thus be viewed as a “console” OS (COS). In such implementations, the kernel 600 preferably also includes a remote procedure call (RPC) mechanism to enable communication between, for example, the VMM 300 and any applications 800 installed to run on the COS 420.
Actions
In kernel-based systems such as the one illustrated in FIG. 1, there must be some way for the kernel 600 to communicate with the VMM 300. In general, the VMM 300 can call into the kernel 600 but the kernel cannot call directly into the VMM. The conventional technique for overcoming this is for the kernel to post “actions” (requests for the VMM to do something) on an action queue 1360 (see FIG. 2) stored in memory 130. As part of the VMM code, the VMM looks at this queue 1360 periodically, and always after it returns from a kernel call and also before it resumes a VM. One typical action, used in this invention (described further below), is the “raise interrupt” action: If the VMM sees this action it will raise an interrupt to the VM 200 in the conventional manner.
As is known, for example, from U.S. Pat. No. 6,397,242 (Devine, et al., 28 May 2002), some virtualization systems allow VM instructions to run directly (in “direct execution”) on the hardware CPU(s) when possible. When necessary, however, VM execution is switched to the technique known as “binary translation,” during which the VM is running in the VMM. In any systems where the VM is running in direct execution when it becomes necessary for the VMM to check actions, the kernel must interrupt the VMM so that it will stop executing VM instructions and check its action queue 1360. This may be done using known programming techniques.
Worlds
The kernel 600 handles not only the various VMM/VMs, but also any other applications running on the kernel, as well as the COS 420 and even the hardware CPU(s) 110, as entities that can be separately scheduled. In this disclosure, each schedulable entity is referred to as a “world,” which contains a thread of control, an address space, machine memory, and handles to the various device objects that it is accessing. Worlds, represented in FIG. 1 within the kernel 600 as module 612, are stored in a portion of the memory space controlled by the kernel. Each world also has its own task structure, and usually also a data structure for storing the hardware state currently associated with the respective world.
There will usually be different types of worlds: 1) system worlds, which are used for idle worlds, one per CPU, and a helper world that performs tasks that need to be done asynchronously; 2) a console world, which is a special world that runs in the kernel and is associated with the COS 420; and 3) virtual machine worlds. Worlds preferably run at the most-privileged level (for example, in a system with the Intel x86 architecture, this will be level CPL0), that is, with full rights to invoke any privileged CPU operations. A VMM, which, along with its VM, constitutes a separate world, therefore may use these privileged instructions to allow it to run its associated VM so that it performs just like a corresponding “real” computer, even with respect to privileged operations.
Switching Worlds
When the world that is running on a particular CPU (which may be the only one) is preempted by or yields to another world, then a world switch has to occur. A world switch involves saving the context of the current world and restoring the context of the new world such that the new world can begin executing where it left off the last time that it is was running.
The first part of the world switch procedure that is carried out by the kernel is that the current world's state is saved in a data structure that is stored in the kernel's data area. Assuming the common case of an underlying Intel x86 architecture, the state that is saved will typically include: 1) the exception flags register; 2) general purpose registers; 3) segment registers; 4) the instruction pointer (EIP) register; 5) the local descriptor table register; 6) the task register; 7) debug registers; 8) control registers; 9) the interrupt descriptor table register; 10) the global descriptor table register; and 11) the floating point state. Similar state information will need to be saved in systems with other hardware architectures.
After the state of the current world is saved, the state of the new world can be restored. During the process of restoring the new world's state, no exceptions are allowed to take place because, if they did, the state of the new world would be inconsistent upon restoration of the state. The same state that was saved is therefore restored. The last step in the world switch procedure is restoring the new world's code segment and instruction pointer (EIP) registers.
When worlds are initially created, the saved state area for the world is initialized to contain the proper information such that when the system switches to that world, then enough of its state is restored to enable the world to start running. The EIP is therefore set to the address of a special world start function. Thus, when a running world switches to a new world that has never run before, the act of restoring the EIP register will cause the world to begin executing in the world start function.
Switching from and to the COS world requires additional steps, which are described in U.S. patent application Ser. No. 09/877,378, which issued as U.S. Pat. No. 6,961,941 on Nov. 1, 2005, mentioned above. Understanding of the details of this process is not necessary for understanding the present invention, however, so further discussion is omitted.
Memory Management in Kernel-Based System
The kernel 600 includes a memory management module 616 that manages all machine memory that is not allocated exclusively to the COS 420. When the kernel 600 is loaded, the information about the maximum amount of memory available on the machine is available to the kernel, as well as information about how much of it is being used by the COS. Part of the machine memory is used for the kernel 600 itself and the rest is used for the virtual machine worlds.
Virtual machine worlds use machine memory for two purposes. First, memory is used to back portions of each world's memory region, that is, to store code, data, stacks, etc., in the VMM page table. For example, the code and data for the VMM 300 is backed by machine memory allocated by the kernel 600. Second, memory is used for the guest memory of the virtual machine. The memory management module may include any algorithms for dynamically allocating memory among the different VM's 200.
Interrupt and Exception Handling in Kernel-Based Systems
Interrupt and exception handling is related to the concept of “worlds” described above. As mentioned above, one aspect of switching worlds is changing various descriptor tables. One of the descriptor tables that is loaded when a new world is to be run is the new world's IDT. The kernel 600 therefore preferably also includes an interrupt/exception handler 655, that is able to intercept and handle (using a corresponding IDT in the conventional manner), interrupts and exceptions for all devices on the machine. When the VMM world is running, whichever IDT is currently loaded is replaced by the VMM's IDT, such that the VMM will handle all interrupts and exceptions.
The VMM will handle some interrupts and exceptions completely on its own. For other interrupts/exceptions, it will be either necessary or at least more efficient for the VMM to call the kernel to have the kernel either handle the interrupts/exceptions itself, or to forward them to some other sub-system such as the COS. One example of an interrupt that the VMM can handle completely on its own, with no call to the kernel, is a check-action IPI (inter-processor interrupt), which is described below. One example of when the VMM preferably calls the kernel, which then forwards an interrupt to the COS, would be where the interrupt involves devices such as a mouse, which is typically controlled by the COS. The VMM may forward still other interrupts to the VM.
Device Access in Kernel-Based System
In the preferred embodiment of the invention, the kernel 600 is responsible for providing access to all devices on the physical machine, in particular, to the NIC 172. In addition to other modules that the designer may choose to load into the kernel, the kernel will therefore typically include conventional drivers as needed to control access to devices. Accordingly, FIG. 1 shows within the kernel 600 a module 610 containing loadable kernel modules and drivers.
Conventional Networking and Packets
In conventional non-virtualized systems, data transfer between an application and various devices 400-1, 400-2, . . . , 400-m often takes place over a shared or dedicated communication channel such as the bus or network 700. It is assumed here that data transfer between the system hardware 100 and each device 400-1, 400-2, . . . , 400-m takes place in units such as “packets”; other types of devices may of course also be connected to the hardware 100, both directly and via the network.
Each device may be considered to be a separate “target” or “destination” when it comes to data transfer. A hardware device controller 175 is also typically included for each device, or for each group of devices that share the bus 700 and communicate using a common protocol. In FIG. 1, only one such device controller 175 is shown, merely for the sake of simplicity. A conventional driver is also loaded in the operating system in order to support the hardware controller 175.
Assume by way of a very common example that the devices 400-1, 400-2, 400-m are USB devices. Whenever some “source” sub-system or process, such as an application, initiates a request for transfer of a block of data D to a USB device, that is, an OUT operation, it establishes a buffer in memory 130 in which it stores the data D. The source sub-system then generates a corresponding transfer request to indicate to the controller's driver that it should begin the procedure (described below) for transferring the data set D. The buffer is also established for data that is to be input from the USB device that is, for an IN operation. Note that, in other systems, according to other protocols, the controller driver may be responsible for establishing the buffer.
The driver then splits the source's data request into sub-blocks whose size is chosen to be consistent with bus bandwidth requirements and bus (for example, USB) protocol mechanisms. For the sake of illustration, assume that the source data set D is subdivided into three sub-sets or “sub-blocks” D1, D2, and D3. In most practical cases, the number of sub-blocks will be much greater, depending on the size of the original data set D. Each sub-block D1, D2, and D3 is used as the basis for a single “transaction,” which results in the data sub-block being transferred from the source's buffer to the USB device, or vice versa. The transfer procedure is typically the same regardless of the number of transactions.
The “raw” data sub-sets D1, D2, D3, etc., alone are generally not enough to adequately define the parameters of a desired transfer. Rather, each sub-set is usually included in or referenced by another data structure that also specifies such information as the destination, the direction of transfer (IN or OUT), the size of the data sub-set to be transferred, etc. In the USB context, the data structures used for this purpose are known as “transfer descriptors” (TDs). Similar descriptors are usually also created for data transfer using other protocols. Continuing with the example of transfer according to the USB protocol, the driver then builds a list of pending transactions that are targeted for one or more USB devices attached to the bus 700. Each TD defines one transaction. The TDs are also stored in memory, in particular, a TD buffer established for the purpose.
In USB-based systems, at a predefined interval, the controller 175 begins to take TDs as inputs, usually (but not necessarily) one at a time, and from each TD and its respective data sub-block creates a data structure known as a “packet.” The controller then transfers the packets sequentially to the bus 700 via a hub (not shown). The concept of a “packet” has somewhat varying definitions in the literature, but is used here to refer to the data structure(s) used to transfer a single data sub-block D1, D2, and D3 to or from at least one destination (usually, a device) via the bus.
In order to guarantee data delivery, during a “handshake” packet phase, the target device returns to the sender (here: controller 175) information in the form of a packet indicating whether the transaction was successful, whether it failed, or whether the intended target device was busy. If no signal is transmitted back to the controller within a predetermined time, then the controller assumes that the packet transfer failed. In the case of a failed packet transfer, assuming any information is returned at all, the returned information normally includes at least the number of bytes that transferred successfully before the failure, and also usually a flag indicating what the error was. In the case of a busy device, the controller typically attempts to resubmit the packet, and may continue to do so until the transfer is success or fails.
Input of data from a device, that is, an IN operation, is also carried out in the form of packets, with the same protocol. As with OUT operations, TDs are generated that define the destination, buffer address, etc. of a data sub-set, but the result of transmission of a packet derived from such a TD is that the data sub-set is input from the destination and placed in the buffer. In short, input of a packet of data is handled in essentially the same manner as packet output, with the obvious difference that the direction in which the corresponding data sub-set is transferred is the opposite. Note that information (in particular, at least one TD) is transmitted from the initiating component to the network (and on to the target device) for both IN and OUT operations.
Conventional Networking in Virtualized Systems
The description above relates to conventional computer systems, but applies also, with some extensions, to virtualized computer systems that run as “guests” on a underlying “host” hardware and software platform. According to the prior art, packet-based data transfer between a source (such as one of the applications 260) within the VM and a physical device (destination) is essentially the same as described above in the non-virtualized context, with the exception that the transfer is “duplicated”: The source data block D is first transferred (usually, copied) from the transfer-requesting source process into a buffer, which is normally established by the source process itself (the normal case) but could alternatively be established by a driver installed in the guest OS 220. This “guest” driver, which is analogous to (and in many cases an identical copy of) the driver in the actual, “host” OS, then builds a list of TDs from the buffered data and stores the TDs in the VM's memory space.
A virtual device controller (a software analog of the controller 175) then constructs packets from the TDs and corresponding data sub-blocks, and passes them sequentially to what it “believes” is a bus. In fact, however, the VM-issued packets are received (in particular, intercepted) by an emulated bus within the VMM. The VMM in turn passes each VM-issued packet to the system software and hardware, which places the (or a corresponding) packet on the “real” bus 700. Note that the device to which (or from which) the packets are to be sent (or received) is typically one of the physical devices 400-1, 400-2, . . . , 400-m, although these may also be emulations. As can be understood from the discussion above, with respect to packet-based transfer, the VM is designed and intended to act just like a conventional non-virtualized system, the major structural difference being that the various hardware components involved, including the controller and the bus, are implemented in software. Again, with respect to packet transfer, the VM/VMM interface is essentially a software “copy” of the hardware 100/bus 700 interface.
Shortcomings of the Prior Art
A well known goal of all networking is increased transfer speed. Unfortunately, the known method for VM networking described above has several structural and procedural features, each of which introduces delay. Delay is caused, for example, by each of the following:
1) transitions within the host systems for both transmitting and receiving;
2) transitions between the VM and the VMM;
3) transitions between the VMM and the kernel; and
4) the need to copy data.
What is needed is a way to provide faster network I/O to and from a VM by eliminating some, and preferably all, of the causes of delay listed above. This invention provides a system configuration and method of operation that accomplishes this goal.