1. Field of the Invention
This invention relates to virtualized computer systems, and, in particular, to a system and method for providing network access to a virtual computer within a physical computer.
2. Description of the Related Art
The advantages of virtual machine technology are widely recognized. Among these advantages is the ability to run multiple virtual computers (or “virtual machines”) on a single physical computer. This can make better use of the capacity of the hardware, while still ensuring that each user or application enjoys the features of a “complete,” isolated computer. A general virtual computer system is described below as background information for the invention.
General Virtualized Computer System
As is well known in the field of computer science, a virtual machine (VM) is a software abstraction or a “virtualization,” often of an actual physical computer system. FIG. 1 illustrates the general configuration of a virtual computer system 700, including one or more virtual machines (VMs), such as a first VM 200 and a second VM 200N, each of which is installed as a “guest” on a “host” hardware platform 100.
As FIG. 1 shows, the hardware platform 100 includes one or more processors (CPUs) 110, system memory 130, and a local disk 140. The system memory is typically some form of high-speed RAM (random access memory), whereas the disk (one or more) is typically a non-volatile, mass storage device. The hardware 100 may also include other conventional mechanisms such as a memory management unit (MMU) 150, various registers 160 and various input/output (I/O) devices 170.
Each VM 200, 200N typically includes at least one virtual CPU 210, at least one virtual disk 240, a virtual system memory 230, a guest operating system 220 (which may simply be a copy of a conventional operating system), and various virtual devices 270, in which case the guest operating system (“guest OS”) may include corresponding drivers 224. All of the components of the VM may be implemented in software using known techniques to emulate the corresponding components of an actual computer.
If the VM is properly designed, then it will generally not be apparent to the user that any applications 260 running within the VM are running indirectly, that is, via the guest OS and virtual processor. Applications 260 running within the VM will typically act just as they would if run on a “real” computer, except for a decrease in running speed, which may only be noticeable in exceptionally time-critical applications. Executable files will be accessed by the guest OS from a virtual disk or virtual memory, which may simply be portions of an actual physical disk or physical memory allocated to that VM. Once an application is installed within the VM, the guest OS retrieves files from the virtual disk just as if they had been pre-stored as the result of a conventional installation of the application. The design and operation of virtual machines is well known in the field of computer science.
Some interface is generally required between a VM and the underlying host platform (in particular, the CPU), which is responsible for actually executing VM-issued instructions and transferring data to and from the actual memory and storage devices. A common term for this interface is a “virtual machine monitor” (VMM), shown as a component 300. A VMM is usually a thin piece of software that runs directly on top of a host, or directly on the hardware, and virtualizes the resources of a physical host machine. Among other components, the VMM therefore usually includes device emulators 330, which may constitute the virtual devices 270 that the VM 200 accesses. The interface exported to the VM may be the same as the hardware interface of the underlying physical machine, so that the guest OS cannot determine the presence of the VMM.
The VMM also usually tracks and either forwards (to some form of operating system) or itself schedules and handles all requests by its VM for machine resources, as well as various faults and interrupts. A mechanism known in the art as an exception or interrupt handler 355 may therefore be included in the VMM. As is well known, such an interrupt/exception handler normally includes an interrupt descriptor table (IDT), or some similar table, which is typically a data structure that uses information in the interrupt signal to point to an entry address for a set of instructions that are to be executed when the interrupt/exception occurs.
Although the VM (and thus the user of applications running in the VM) cannot usually detect the presence of the VMM, the VMM and the VM may be viewed as together forming a single virtual computer. They are shown in FIG. 1 as separate components for the sake of clarity.
Moreover, the various virtualized hardware components such as the virtual CPU(s) 210, the virtual memory 230, the virtual disk 240, and the virtual device(s) 270 are shown as being part of the VM 200 for the sake of conceptual simplicity—in actual implementations these “components” are usually constructs or emulations exported to the VM by the VMM. For example, the virtual disk 240 is shown as being within the VM 200. This virtual component, which could alternatively be included among the virtual devices 270, may in fact be implemented as one of the device emulators 330 in the VMM.
The device emulators 330 emulate the system resources for use within the VM. These device emulators will then typically also handle any necessary conversions between the resources as exported to the VM and the actual physical resources. One advantage of such an arrangement is that the VMM may be set up to expose “generic” devices, which facilitates VM migration and hardware platform-independence. For example, the VMM may be set up with a device emulator 330 that emulates a standard Small Computer System Interface (SCSI) disk, so that the virtual disk 240 appears to the VM 200 to be a standard SCSI disk connected to a standard SCSI adapter, whereas the underlying, actual, physical disk 140 may be something else. In this case, a standard SCSI driver is installed into the guest OS 220 as one of the drivers 224. The device emulator 330 then interfaces with the driver 224 and handles disk operations for the VM 200. The device emulator 330 then converts the disk operations from the VM 200 to corresponding disk operations for the physical disk 140.
Virtual and Physical Memory
As in most modern computers, the address space of the memory 130 is partitioned into pages (for example, in the x86 architecture) or other analogous units. Applications then address the memory 130 using virtual addresses (VAs), which include virtual page numbers (VPNs). The VAs are then mapped to physical addresses (PAs) that are used to address the physical memory 130. (VAs and PAs have a common offset from a base address, so that only the VPN needs to be converted into a corresponding physical page number (PPN).) The concepts of VPNs and PPNs, as well as the way in which the different page numbering schemes are implemented and used, are described in many standard texts, such as “Computer Organization and Design: The Hardware/Software Interface,” by David A. Patterson and John L. Hennessy, Morgan Kaufmann Publishers, Inc., San Francisco, Calif., 1994, pp. 579-603 (chapter 7.4 “Virtual Memory”). Similar mappings are used in other architectures where relocatability is possible.
An extra level of addressing indirection is typically implemented in virtualized systems in that a VPN issued by an application 260 in the VM 200 is remapped twice in order to determine which page of the hardware memory is intended. The first mapping is provided by a mapping module within the guest OS 220, which translates the guest VPN (GVPN) into a corresponding guest PPN (GPPN) in the conventional manner. The guest OS therefore “believes” that it is directly addressing the actual hardware memory, but in fact it is not.
Of course, a valid address to the actual hardware memory must ultimately be generated. A memory management module 350, typically located in the VMM 300, therefore performs the second mapping by taking the GPPN issued by the guest OS 220 and mapping it to a hardware (or “machine”) page number PPN that can be used to address the hardware memory 130. This GPPN-to-PPN mapping may instead be done in the main system-level software layer (such as in a mapping module in a kernel 600, which is described below), depending on the implementation. From the perspective of the guest OS, the GVPN and GPPN might be virtual and physical page numbers just as they would be if the guest OS were the only OS in the system. From the perspective of the system software, however, the GPPN is a page number that is then mapped into the physical memory space of the hardware memory as a PPN.
System Software Configurations in Virtualized Systems
In some systems, such as the Workstation product of VMware, Inc., of Palo Alto, Calif., the VMM is co-resident at system level with a host operating system. Both the VMM and the host OS can independently modify the state of the host processor, but the VMM calls into the host OS via a driver and a dedicated user-level application to have the host OS perform certain I/O operations on behalf of the VM. The virtual computer in this configuration is thus fully hosted in that it runs on an existing host hardware platform and together with an existing host OS.
In other implementations, a dedicated kernel takes the place of and performs the conventional functions of the host OS, and virtual computers run on the kernel. FIG. 1 illustrates a kernel 600 that serves as the system software for several VM/VMM pairs 200/300, . . . , 200N/300N. Compared with a system in which VMMs run directly on the hardware platform, use of a kernel offers greater modularity and facilitates provision of services that extend across multiple VMs (for example, for resource management). Compared with the hosted deployment, a kernel may offer greater performance because it can be co-developed with the VMM and be optimized for the characteristics of a workload consisting of VMMs. The ESX Server product of VMware, Inc., has such a configuration. The invention described below takes advantage of the ability to optimize a kernel as a platform for virtual computers.
A kernel-based virtualization system of the type illustrated in FIG. 1 is described in U.S. patent application Ser. No. 09/877,378 (“Computer Configuration for Resource Management in Systems Including a Virtual Machine”), which is incorporated here by reference. The main components of this system and aspects of their interaction are, however, outlined below.
At boot-up time, an existing operating system 420 may be at system level and the kernel 600 may not yet even be operational within the system. In such case, one of the functions of the OS 420 may be to make it possible to load the kernel 600, after which the kernel runs on the native hardware 100 and manages system resources. In effect, the kernel, once loaded, displaces the OS 420. Thus, the kernel 600 may be viewed either as displacing the OS 420 from the system level and taking this place itself, or as residing at a “sub-system level.” When interposed between the OS 420 and the hardware 100, the kernel 600 essentially turns the OS 420 into an “application,” which has access to system resources only when allowed by the kernel 600. The kernel then schedules the OS 420 as if it were any other component that needs to use system resources.
The OS 420 may also be included to allow applications unrelated to virtualization to run; for example, a system administrator may need such applications to monitor the hardware 100 or to perform other administrative routines. The OS 420 may thus be viewed as a “console” OS (COS). In such implementations, the kernel 600 preferably also includes a remote procedure call (RPC) mechanism to enable communication between, for example, the VMM 300 and any applications 430 installed to run on the COS 420.
Actions
In kernel-based systems such as the one illustrated in FIG. 1, there must be some way for the kernel 600 to communicate with the VMM 300. In general, the VMM 300 can call into the kernel 600 but the kernel cannot call directly into the VMM. The conventional technique for overcoming this is for the kernel to post “actions” (requests for the VMM to do something) on an action queue stored in memory 130. As part of the VMM code, the VMM looks at this queue periodically, and always after it returns from a kernel call and also before it resumes a VM. One typical action is the “raise interrupt” action: If the VMM sees this action it will raise an interrupt to the VM 200 in the conventional manner.
As is known, for example, from U.S. Pat. No. 6,397,242 (Devine, et al., 28 May 2002), some virtualization systems allow VM instructions to run directly (in “direct execution”) on the hardware CPU(s) when possible. When necessary, however, VM execution is switched to the technique known as “binary translation,” during which the VM is running in the VMM. In any systems where the VM is running in direct execution when it becomes necessary for the VMM to check actions, the kernel must interrupt the VMM so that it will stop executing VM instructions and check its action queue. This may be done using known programming techniques.
Worlds
The kernel 600 handles not only the various VMM/VMs, but also any other applications running on the kernel, as well as the COS 420 and even the hardware CPU(s) 110, as entities that can be separately scheduled. In this disclosure, each schedulable entity is referred to as a “world,” which contains a thread of control, an address space, machine memory, and handles to the various device objects that it is accessing. Worlds are stored in a portion of the memory space controlled by the kernel. More specifically, the worlds are controlled by a world manager, represented in FIG. 1 within the kernel 600 as module 612. Each world also has its own task structure, and usually also a data structure for storing the hardware state currently associated with the respective world.
There will usually be different types of worlds: 1) system worlds, which are used for idle worlds, one per CPU, and a helper world that performs tasks that need to be done asynchronously; 2) a console world, which is a special world that runs in the kernel and is associated with the COS 420; and 3) virtual machine worlds.
Worlds preferably run at the most-privileged level (for example, in a system with the x86 architecture, this will be level CPL0), that is, with full rights to invoke any privileged CPU operations. A VMM, which, along with its VM, constitutes a separate world, therefore may use these privileged instructions to allow it to run its associated VM so that it performs just like a corresponding “real” computer, even with respect to privileged operations.
Switching Worlds
When the world that is running on a particular CPU (which may be the only one) is preempted by or yields to another world, then a world switch has to occur. A world switch involves saving the context of the current world and restoring the context of the new world such that the new world can begin executing where it left off the last time that it was running.
The first part of the world switch procedure that is carried out by the kernel is that the current world's state is saved in a data structure that is stored in the kernel's data area. Assuming the common case of an underlying x86 architecture, the state that is saved will typically include: 1) the exception flags register; 2) general purpose registers; 3) segment registers; 4) the instruction pointer (EIP) register; 5) the local descriptor table register; 6) the task register; 7) debug registers; 8) control registers; 9) the interrupt descriptor table register; 10) the global descriptor table register; and 11) the floating point state. Similar state information will need to be saved in systems with other hardware architectures.
After the state of the current world is saved, the state of the new world can be restored. During the process of restoring the new world's state, no exceptions are allowed to take place because, if they did, the state of the new world would be inconsistent upon restoration of the state. The same state that was saved is therefore restored. The last step in the world switch procedure is restoring the new world's code segment and instruction pointer (EIP) registers.
When worlds are initially created, the saved state area for the world is initialized to contain the proper information such that when the system switches to that world, then enough of its state is restored to enable the world to start running. The EIP is therefore set to the address of a special world start function. Thus, when a running world switches to a new world that has never run before, the act of restoring the EIP register will cause the world to begin executing in the world start function.
Switching from and to the COS world requires additional steps, which are described in U.S. patent application Ser. No. 09/877,378, mentioned above. Understanding the details of this process is not necessary for understanding the present invention, however, so further discussion is omitted.
Memory Management in Kernel-Based System
The kernel 600 includes a memory management module 616 that manages all machine memory that is not allocated exclusively to the COS 420. When the kernel 600 is loaded, the information about the maximum amount of memory available on the machine is available to the kernel, as well as information about how much of it is being used by the COS. Part of the machine memory is used for the kernel 600 itself and the rest is used for the virtual machine worlds.
Virtual machine worlds use machine memory for two purposes. First, memory is used to back portions of each world's memory region, that is, to store code, data, stacks, etc. For example, the code and data for the VMM 300 is backed by machine memory allocated by the kernel 600. Second, memory is used for the guest memory of the virtual machine. The memory management module may include any algorithms for dynamically allocating memory among the different VM's 200.
Interrupt and Exception Handling in Kernel-Based Systems
Interrupt and exception handling is related to the concept of “worlds” described above. As mentioned above, one aspect of switching worlds is changing various descriptor tables. One of the descriptor tables that is loaded when a new world is to be run is the new world's IDT. The kernel 600 therefore preferably also includes an interrupt/exception handler 655 that is able to intercept and handle (using a corresponding IDT in the conventional manner) interrupts and exceptions for all devices on the machine. When the VMM world is running, whichever IDT was previously loaded is replaced by the VMM's IDT, such that the VMM will handle all interrupts and exceptions.
The VMM will handle some interrupts and exceptions completely on its own. For other interrupts/exceptions, it will be either necessary or at least more efficient for the VMM to call the kernel to have the kernel either handle the interrupts/exceptions itself, or to forward them to some other sub-system such as the COS. One example of an interrupt that the VMM can handle completely on its own, with no call to the kernel, is a check-action IPI (inter-processor interrupt). One example of when the VMM preferably calls the kernel, which then forwards an interrupt to the COS, would be where the interrupt involves devices such as a mouse, which is typically controlled by the COS. The VMM may forward still other interrupts to the VM.
Device Access in Kernel-Based System
In some embodiments of the invention, the kernel 600 is responsible for providing access to all devices on the physical machine. In addition to other modules that the designer may choose to load onto the system for access by the kernel, the kernel will therefore typically load conventional drivers as needed to control access to devices. Accordingly, FIG. 1 shows a module 610 containing loadable kernel modules and drivers. The kernel 600 may interface with the loadable modules and drivers in a conventional manner, using an application program interface (API) or similar interface.
Example Virtual Computer System
FIG. 2 shows one possible configuration for the generalized virtual computer system 700 of FIG. 1, which is useful for describing the invention. Thus, FIG. 2 shows a virtual computer system 700A that includes four VMs, namely a VM-1 200A, a VM-2 200B, a VM-3 200C and a VM-4 200D. Each of the VMs 200A, 200B, 200C and 200D may be based on a common x86 architecture (For reference, see documents related to the Intel IA-32 architecture, for example.), for example, and each of the VMs is loaded with a guest OS and a set of one or more applications. Thus, the VM-1 200A is loaded with a first guest OS 220A and a first set of applications 260A, the VM-2 200B is loaded with a second guest OS 220B and a second set of applications 260B, the VM-3 200C is loaded with a third guest OS 220C and a third set of applications 260C, and the VM-4 200D is loaded with a fourth guest OS 220D and a fourth set of applications 260D. The guest OSs 220A, 220B, 220C and 220D may be any combination of supported OSs, ranging from all four of the OSs being the same OS to each of the OSs being a different OS. For example, the guest OS 220A may be a Windows Server 2003 OS from Microsoft Corp., the guest OS 220B may be a Linux distribution from Red Hat, Inc., the guest OS 220C may be a Solaris OS from Sun Microsystems, Inc., and the guest OS 220D may also be the Windows Server 2003 OS from Microsoft Corp. The applications 260A, 260B, 260C and 260D may be any combination of supported applications, possibly with some applications being common to multiple VMs.
The VM-1 200A is supported by a first VMM 300A, the VM-2 200B is supported by a second VMM 300B, the VM-3 200C is supported by a third VMM 300C, and the VM-4 200D is supported by a fourth VMM 300D. Each of the VMMs 300A, 300B, 300C and 300D may be substantially the same as the VMM 300 described above in connection with FIG. 1. Thus, in particular, each of the VMMs 300A, 300B, 300C and 300D may include the interrupt handler 355, the device emulators 330 and the memory management unit 350 that are illustrated in FIG. 1. All of the VMMs 300A, 300B, 300C and 300D are supported by the same kernel 600 that was illustrated in FIG. 1 and that was partially described above, including the memory management unit 616, the world manager 612 and the interrupt/exception handler 655. The virtual computer system 700A of FIG. 2 also includes the set of loadable modules and drivers 610 that were illustrated in FIG. 1, along with the system hardware 100. The virtual computer system 700A of FIG. 2 may also include the console OS 420 and the applications 430 shown in FIG. 1, although these units are not shown in FIG. 2 for simplicity.
FIG. 2 also shows the virtual computer system 700A being connected to one or more computer networks 20 by a first network interface card (NIC) 180A and a second NIC 180B. The network(s) 20 may be a simple network, such as a local area network (LAN) based on any of a variety of networking technologies, or it may be an interconnection of multiple networks using one or more networking technologies, including zero or more LANs and zero or more wide area networks (WANs). The NICs 180A and 1806 are appropriate for the system hardware 100 and for the network(s) 20 to which the virtual computer system 700A is connected.
The description in this patent generally assumes the use of the popular Ethernet networking technology for simplicity, although it may also be applied to other networking technologies, including other layer 2 technologies of the Open System Interconnection (OSI) model. There are numerous books available on Ethernet technology and a large variety of other networking and internetworking technologies. In this patent, the word Ethernet is generally used to refer to any of the variations of Ethernet technology, including, in particular, the standard IEEE (Institute of Electrical and Electronics Engineers, Inc.) 802.3 interfaces operating at 1 megabit per second (Mbps), 10 Mbps, 100 Mbps, 1 gigabit per second (Gbps) and 10 Gbps. Thus, if the network 20 is an Ethernet network, then the NICs 180A and 180B are Ethernet cards that are compatible with the system hardware 100. The system hardware 100 may constitute a conventional server computer based on the x86 architecture, for example. In this case, the NICs 180A and 180B may be Intel PRO/100 Ethernet NICs, Intel PRO/1000 Gigabit Ethernet NICs, or various other NICs, including possibly a combination of different types of NICs from the same or from different manufacturers.
In general terms, the virtual computer system 700A comprises virtualization software that supports the VMs 200A, 200B, 200C and 200D and enables the VMs to operate within the system hardware 100 and to utilize the resources of the system hardware. In the particular virtual computer system illustrated in FIG. 2, the virtualization software comprises the kernel 600, the loadable modules and drivers 610 and the VMMs 300A, 300B, 300C and 300D. In other virtual computer systems, the virtualization software may comprise other software modules or other combinations of software modules. Of particular relevance to this invention, the virtualization software in the virtual computer system 700A of FIG. 2 enables the VMs 200A, 200B, 200C and 200D to access the computer network(s) 20 through the NICs 180A and 180B. A similar virtual computer system that enabled VMs to access computer networks through physical NIC(s) of a physical computer system was described in U.S. patent application Ser. No. 10/665,779, entitled “Managing Network Data Transfers in a Virtual Computer System” (the '779 application), which is incorporated here by reference. The virtualization software of the virtual computer system 700A may be substantially the same as corresponding software modules of the virtual computer system described in the '779 application, except as described below.
Similar to the virtual computer system of the '779 application, FIG. 2 shows two NIC drivers 680A and 680B in the modules and drivers 610. The NIC driver 680A operates as a driver for the NIC 180A and the NIC driver 680B operates as a driver for the NIC 180B. Each of the NIC drivers 680A and 680B may be substantially the same as a conventional, basic NIC driver for the corresponding NIC 180A or 180B. The NIC drivers 680A and 680B are specific to the particular types of NICs used as the NICs 180A and 180B, respectively. If the two NICs are of the same type, then the corresponding NIC drivers may be separate instances of the same NIC driver. For example, for a Linux platform, if the NICs are both Intel PRO/100 Ethernet NICs, then the NIC drivers may be separate instances of the e100 driver from Intel. As is well known, the NIC drivers control the NICs, and provide an interface with the NICs. In other implementations of this invention, there may be a larger number of NICs 180 and corresponding NIC drivers 680, or there could be a smaller number of each.
One of the device emulators 330 (see FIG. 1) within each of the VMMs 300A, 300B, 300C and 300D emulates one or more NICs to create virtual NICs for the VMs 200A, 200B, 200C and 200D. Thus, in the virtual computer system of FIG. 2, a device emulator 330 within the VMM 300A supports a virtual NIC 280A for the VM 200A, a device emulator 330 within the VMM 300B supports a virtual NIC 280B for the VM 200B, a device emulator 330 within the VMM 300C supports a virtual NIC 280C for the VM 200C, and a device emulator 330 within the VMM 300D supports a first virtual NIC 280D for the VM 200D and a second virtual NIC 280E for the VM 200D. The device emulators 330 preferably emulate the NICs in such a way that software within the VMs 200A, 200B, 200C and 200D, as well as a user of the VMs, cannot tell that the virtual NICs 280A, 280B, 280C, 280D and 280E are not actual, physical NICs. Techniques for emulating a NIC in this manner are well known in the art. The virtual NICs 280A, 280B, 280C, 280D and 280E may all be generic NICs, they may all be specific NICs, such as Intel PRO/100 Ethernet NICs, for example, or they may be a combination of generic NICs and/or specific NICs of one or more different types.
The virtual NICs are preferably widely supported NICs, having drivers available for a large number and variety of OSs, such as the PCnet Lance Ethernet driver, from Advanced Micro Devices, Inc., which is built into all OSs that are common at this time. A NIC driver that is appropriate for each of the virtual NICs and the corresponding guest OSs is loaded as one of the drivers 224 (see FIG. 1), if it is not already resident in the corresponding guest OS. Thus, a NIC driver 281A that is appropriate for the virtual NIC 280A and the guest OS 220A is loaded as a driver 224 in the VM 200A, a NIC driver 281B that is appropriate for the virtual NIC 280B and the guest OS 220B is loaded as a driver 224 in the VM 200B, a NIC driver 281C that is appropriate for the virtual NIC 280C and the guest OS 220C is loaded as a driver 224 in the VM 200C, and a NIC driver 281D that is appropriate for the virtual NICs 280D and 280E and the guest OS 220D is loaded as a driver 224 in the VM 200D. Here, the virtual NICs 280D and 280E are assumed to be of the same type for simplicity, so that the corresponding NIC drivers may be separate instances of the same NIC driver 281D, although the virtual NICs 280D and 280E may alternatively be of different types, requiring different NIC drivers. The NIC drivers 281A, 281B, 281C and 281D may be standard NIC drivers for use with the corresponding emulated virtual NICs 280A, 280B, 280C, 280D and 280E, or they may be custom NIC drivers that are optimized for the virtual computer system 700A.
From the perspective of the guest OSs 220A, 220B, 220C and 220D, the guest applications 260A, 260B, 260C and 260D, and the users of any of this guest software, the respective VMs 200A, 200B, 200C and 200D preferably appear to be conventional physical computers, and the virtual NICs 280A, 280B, 280C, 280D and 280E preferably appear to be conventional physical NICs connected to the network(s) 20. Thus, guest software and/or users of the guest software may attempt to communicate with other computers over the network(s) 20 in a conventional manner. For example, the VM 200A may implement an email server that is accessible through the network(s) 20. A client computer attached to the network(s) 20 may communicate with the VM 200A to retrieve email messages, and the VM 200A may respond, as appropriate. These communications would involve one or more incoming network data frames that would arrive from the client computer at one or both of the NICs 180A and 180B and that must be forwarded to the VM 200A, along with one or more outgoing network data frames that would be sent out by the VM 200A and that must be transmitted to the client computer through one or both of the NICs 180A and 180B. One of the functions of the virtualization software of the virtual computer system 700A is to facilitate these communications by the VMs 200A, 200B, 200C and 200D over the network(s) 20 by conveying incoming and outgoing network data frames between the VMs and the physical NICs 180A and 180B.
As described in the '779 application, a NIC manager 642 plays an important role in enabling software entities within the virtual computer system 700A to communicate over the network(s) 20. In FIG. 2, the NIC manager 642 is shown as being implemented within the kernel 600, although the NIC manager 642 may alternatively be implemented as a driver within the modules and drivers 610. Thus, the NIC manager 642 receives outgoing network data frames, from the VMs 200A, 200B, 200C and 200D and forwards them to the NIC drivers 680A and 680B for transmission onto the network by the respective NICs 180A and 180B. The MC manager 642 also receives incoming network data frames from the NICs, through the NIC drivers, and routes them to the appropriate destinations, such as the VMs 200A, 200B, 200C and 200D, based on the layer 2 and/or layer 3 destination address(es) contained in the data frames. For example, for internet protocol (IP) data over an Ethernet network, the NIC manager 642 routes the data frames based on the medium access control (MAC) address and/or the IP address.
When a software entity within one of the VMs 200A, 200B, 200C or 200D wants to send an outgoing data frame to the network(s) 20, the data frame is sent to the respective NIC driver 281A, 281B, 281C or 281D, so that the respective NIC driver can send the data frame onto the network(s) 20 using the respective virtual NIC 280A, 280B, 280C, 280D or 280E, with the virtual NICs appearing to be connected directly to the network(s) 20. Instead of going directly out onto the network(s) 20, however, the data frame is first forwarded to the NIC manager 642. As an example, a software entity within the VM 200A may send an outgoing data frame to the NIC driver 281A for transmission on the network(s) 20. This data frame is forwarded from the NIC driver 281A to the NIC manager 642. This forwarding of data frames can be accomplished in a variety of ways. For example, if the virtual NIC 280A emulated by the device emulator 330 (see FIG. 1) is a standard NIC that provides direct memory access (DMA) capabilities, and the NIC driver 281A (see FIG. 2) is a standard NIC driver for that particular type of NIC, then the NIC driver 281A attempts to set up the NIC 280A to perform a DMA transfer of the data frame. The device emulator 330 responds by communicating with the NIC driver 281A and performing the transfer of data, making it appear to the NIC driver 281A that the virtual NIC 280A performed the DMA transfer, as expected. The emulator 330 then provides the data frame to the NIC manager 642 for routing through one of the NIC drivers 680A and 680B and the corresponding NIC 180A or 180B onto the network. For example, the emulator 330 may copy the data frame from a memory page that is controlled by the VM 200A to a memory page that is controlled by the kernel 600, and which is accessible to the NIC manager 642, and, more particularly, to the NIC drivers 680A and 680B.
Similarly, for an incoming data frame to the VM 200A, the NIC manager 642 receives the data frame from one of the NIC drivers 680A or 680B and forwards the data frame to the device emulator 330 within the VMM 300A. The device emulator 330 places the data frame in an appropriate location in memory and generates an appropriate interrupt to the guest OS 220A to cause the NIC driver 281A to retrieve the data frame from memory. A person of skill in the art will understand how to emulate the virtual NICs 280A, 280B, 280C, 280D and 280E in this manner to facilitate the transfer of data frames between the NIC drivers 281A, 281B, 281C and 281D and the NIC manager 642. A person of skill in the art will also understand how to minimize the number of times that data frames are copied in transferring data between the NIC drivers 281A, 281B, 281C and 281D and the network(s) 20, depending on the particular implementation. For example, for an outgoing data frame, it may be possible to set up the physical NICs 180A and 180B to perform DMA transfers directly from the NIC drivers 281A, 281B, 281C and 281D, to avoid any unnecessary copying of the data.
For this description, suppose that the virtual computer system 700A is connected to an Ethernet network and that each of the physical NICs 180A and 180B and each of the virtual NICs 280A, 280B, 280C, 280D and 280E are Ethernet cards. In this case, each of the virtual NICs 280A, 280B, 280C, 280D and 280E preferably has a MAC address that is unique, at least within the virtual computer system 700A, and preferably also within the local network to which the virtual computer system 700A is connected. Then, for example, any outgoing data frames from the VM 200A will contain the MAC address of the virtual NIC 280A in the source address field of the Ethernet frame, and any incoming data frames for the VM 200A will contain the same MAC address, or a broadcast or multicast address, in the destination address field of the Ethernet frame. Each of the NICs 180A and 180B may be placed in a promiscuous mode, which causes the NICs to receive all incoming data frames and forward them to the respective NIC drivers 680A and 680B, even if they don't contain the MAC address of the respective NIC. This ensures that the NIC manager 642 receives data frames containing the MAC address of each of the virtual NICs 280A, 280B, 280C, 280D and 280E. The NIC manager 642 then routes incoming data frames to the appropriate VMs 200A, 200B, 200C and 200D, based on the MAC address that is contained in the destination field of the Ethernet frame. The NIC manager 642 is generally able to transmit data frames from the VMs 200A, 200B, 200C and 200D through the NICs 180A and 180B, using the MAC address of the respective virtual NIC 280A, 280B, 280C, 280D or 280E within the source field of the Ethernet frame. In other words, the physical NICs 180A and 180B generally transmit outgoing data frames onto the network, even if the data frames do not contain the MAC address of the physical NICs in the source address field.
Incoming data frames may also be routed to other destinations within the virtual computer system 700A, such as to an application 430 (see FIG. 1), as appropriate. Similarly, other entities within the virtual computer system 700A may generate outgoing data frames for transmission on the attached network. For example, on behalf of an application 430, a NIC driver within the COS 420 (see FIG. 1 again), possibly in coordination with the NIC manager 642, may insert the MAC address of one of the NICs 180A or 180B into the source field of the Ethernet header of an outgoing data frame. Then, responsive incoming data frames destined for the application 430 will contain the same MAC address, or a broadcast or multicast address, in the destination field of the Ethernet frame. Using these techniques, the NIC drivers 281A, 281B, 281C and 281D within the guest OSs 220A, 220B, 220C and 220D, the NIC driver within the COS 420, the virtual NICs 280A, 280B, 280C, 280D and 280E, the device emulators 330, the NIC manager 642, the NIC drivers 680A and 680B, and the physical NICs 180A and 180B are able to transfer both incoming data frames and outgoing data frames between numerous different software entities within the virtual computer system 700A and numerous different software entities on the network.
One of the primary functions of the NIC manager 642 is to decide which outgoing data frames will be routed over each of the physical NICs 180A and 180B. As described in the '779 application, the NIC manager 642 operates in coordination with a VM manager 660 and a resource manager 662, which are additional units of the kernel 600, as illustrated in FIG. 2. The VM manager 660 and the resource manager 662 may be combined into a single software unit or they may be implemented as separate units as illustrated in FIG. 2. The VM manager 660 and the resource manager 662 are illustrated and described as separate units herein simply because they have distinct functions. The VM manager 660 performs high-level functions related to the control and operation of the VMs 200A, 200B, 200C and 200D. For example, the VM manager 660 may initialize a new VM, suspend an active VM, terminate a VM or cause a VM to migrate to another physical computer system. The VM manager 660 may perform these actions in response to a variety of stimuli or conditions, such as in response to commands from a system administrator at a control console, in response to conditions within a VM or in response to other conditions within the virtual computer system 700A.
The resource manager 662 generally allocates system resources between the multiple VMs 200A, 200B, 200C and 200D, as well as between the other worlds within the virtual computer system. For example, the resource manager 662 schedules and manages access to the CPU(s), the memory, the network resources and any accessible data storage resources. The resource manager 662 may allow a system administrator to specify various levels of service that are to be provided to each of the VMs for each of the system resources. For example, an application 430 running on the COS 420 (see FIG. 1) may provide a user interface to a system administrator, enabling the system administrator to control numerous system parameters, including the levels of service of system resources for the multiple VMs 200A, 200B, 200C and 200D. The resource manager 662 then works with other units within the computer system 700A to provide the requested levels of service.
The NIC manager 642 preferably obtains and evaluates a variety of NIC management information and VM-specific information received from the VM manager 660, the resource manager 662 and other sources, in deciding whether a data frame is to be transferred onto the network(s) 20, queued for transferring at a later time, or discarded; and, if a decision is made to transfer the data frame, the NIC manager 642 also decides over which NIC 180A or 1808 the data frame is to be transferred.
As also described in the '779 application, the NIC manager 642 preferably provides NIC teaming capabilities such as failover and failback functions, along with a load distribution function, when making these decisions. A wide variety of algorithms may be implemented in making data frame routing decisions. One such algorithm involves sending outgoing data frames from each VM in the virtual computer system over a different physical NIC. For example, if there are two VMs in the system and two physical NICs, one VM's data frames would be routed over the first physical NIC and the other VM's data frames would be routed over the other physical NIC. As described in the '779 application, such an algorithm provides greater isolation between the operation of the different VMs in the system. This algorithm is not possible, however, in many virtual computer systems, because the number of VMs in such systems exceeds the number of physical NICs in the systems. For example, in the virtual computer system 700A of FIG. 2, there are four VMs 200A, 200B, 200C and 200D, but only two physical NICs 180A and 180B. In this case, there is no way to give each VM in the system its own physical NIC.
Suppose in the virtual computer system 700A of FIG. 2 that the NIC manager 642 is configured to route data frames from the VM-1 200A and the VM-2 200B through the first physical NIC 180A and to route data frames from the VM-3 200C and the VM-4 200D through the second physical NIC 180B. In this case, the VM-1 200A and the VM-2 200B are sharing the first physical NIC 180A, while the VM-3 200C and the VM-4 200D are sharing the second physical NIC 180B. The NIC manager 642 may also restrict incoming data frames in a similar way, so that, if an incoming broadcast data frame is received at the first NIC 180A but not at the second NIC 180B (for example, if the NICs are connected to different networks), then the data frame is delivered to the VM-1 200A and the VM-2 200B, but not to the VM-3 200C or the VM-4 200D. Although using this algorithm improves the isolation between some of the VMs, each of the physical NICs 180A and 180B is still shared between multiple VMs. As another alternative, suppose that a critical application requiring consistent, reliable network access is executing in the VM-1 200A. In this case, the physical NIC 180A may be dedicated for use by the VM-1 200A, while the other VMs 200B, 200C and 200D all use the other physical NIC 180B for network access. This algorithm improves the isolation of the VM-1 200A, but the remaining VMs must still share a physical NIC.
This sharing of physical NICs causes a possible security risk, with the degree of risk varying, depending on the implementation and use of the virtual computer system. For example, suppose that a first VM in a virtual computer system is afflicted with a virus or some other malicious software while a second VM is running an important application. Or, even worse, suppose that the first VM is being actively used by a sophisticated hacker while an important application is running in the second VM. Any ability within the first VM to tap into the network traffic of the second VM increases the risk of compromising the second VM.
The same risk of sharing NICs also exists if the NIC manager 642 implements other algorithms for routing data frames between the VMs 200A, 200B, 200C and 200D and the physical NICs 180A and 180B, likely to an even greater extent. For example, if the NIC manager 642 implements a “round robin” algorithm, so that outgoing data frames from all of the VMs are sent alternately over the first NIC 180A and the second NIC 180B, each of the VMs 200A, 200B, 200C and 200D uses both of the NICs 180A and 180B. The NIC manager 642 must generally forward incoming broadcast data frames received at either of the NICs 180A or 180B to all of the VMs 200A, 200B, 200C and 200D. In this case, all of the VMs in the system will have some access to the network traffic of each of the other VMs.
What is needed, therefore, is a technique for improving the isolation between the network traffic of multiple VMs in a virtual computer system, where the number of VMs in the system exceeds the number of physical NICs in the system.