1. Field of the Invention
This invention relates generally to a virtualized computer system and, in particular, to a method and system for using swap space for host physical memory with separate swap files corresponding to different virtual machines.
2. Description of the Related Art
The advantages of virtual machine technology have become widely recognized. Among these advantages is the ability to run multiple virtual machines on a single host platform. This makes better use of the capacity of the hardware, while still ensuring that each user enjoys the features of a “complete” computer. Depending on how it is implemented, virtualization can also provide greater security, since the virtualization can isolate potentially unstable or unsafe software so that it cannot adversely affect the hardware state or system files required for running the physical (as opposed to virtual) hardware.
As is well known in the field of computer science, a virtual machine (VM) is an abstraction—a “virtualization”—of an actual physical computer system. FIG. 1 shows one possible arrangement of a computer system 700 that implements virtualization. A virtual machine (VM) or “guest” 200 is installed on a “host platform,” or simply “host,” which will include system hardware, that is, a hardware platform 100, and one or more layers or co-resident components comprising system-level software, such as an operating system or similar kernel, or a virtual machine monitor or hypervisor (see below), or some combination of these. The system hardware typically includes one or more processors 110, memory 130, some form of mass storage 140, and various other devices 170.
Each VM 200 will typically have both virtual system hardware 201 and guest system software 202. The virtual system hardware typically includes at least one virtual CPU, virtual memory 230, at least one virtual disk 240, and one or more virtual devices 270. Note that a disk—virtual or physical—is also a “device,” but is usually considered separately because of the important role of the disk. All of the virtual hardware components of the VM may be implemented in software using known techniques to emulate the corresponding physical components. The guest system software includes a guest operating system (OS) 220 and drivers 224 as needed for the various virtual devices 270.
Note that a single VM may be configured with more than one virtualized processor. To permit computer systems to scale to larger numbers of concurrent threads, systems with multiple CPUs have been developed. These symmetric multi-processor (SMP) systems are available as extensions of the PC platform and from other vendors. Essentially, an SMP system is a hardware platform that connects multiple processors to a shared main memory and shared I/O devices. Virtual machines may also be configured as SMP VMs. FIG. 1, for example, illustrates multiple virtual processors 210-0, 210-1, . . . , 210-m (VCPU0, VCPU1, . . . , VCPUm) within the VM 200.
Yet another configuration is found in a so-called “multi-core” architecture, in which more than one physical CPU is fabricated on a single chip, with its own set of functional units (such as a floating-point unit and an arithmetic/logic unit ALU), and can execute threads independently; multi-core processors typically share only very limited resources, such as some cache. Still another technique that provides for simultaneous execution of multiple threads is referred to as “simultaneous multi-threading,” in which more than one logical CPU (hardware thread) operates simultaneously on a single chip, but in which the logical CPUs flexibly share some resource such as caches, buffers, functional units, etc. This invention may be used regardless of the type—physical and/or logical—or number of processors included in a VM.
If the VM 200 is properly designed, applications 260 running on the VM will function as they would if run on a “real” computer, even though the applications are running at least partially indirectly, that is via the guest OS 220 and virtual processor(s). Executable files will be accessed by the guest OS from the virtual disk 240 or virtual memory 230, which will be portions of the actual physical disk 140 or memory 130 allocated to that VM. Once an application is installed within the VM, the guest OS retrieves files from the virtual disk just as if the files had been pre-stored as the result of a conventional installation of the application. The design and operation of virtual machines are well known in the field of computer science.
Some interface is generally required between the guest software within a VM and the various hardware components and devices in the underlying hardware platform. This interface—which may be referred to generally as “virtualization software”—may include one or more software components and/or layers, possibly including one or more of the software components known in the field of virtual machine technology as “virtual machine monitors” (VMMs), “hypervisors,” or virtualization “kernels.” Because virtualization terminology has evolved over time and has not yet become fully standardized, these terms do not always provide clear distinctions between the software layers and components to which they refer. For example, “hypervisor” is often used to describe both a VMM and a kernel together, either as separate but cooperating components or with one or more VMMs incorporated wholly or partially into the kernel itself; however, “hypervisor” is sometimes used instead to mean some variant of a VMM alone, which interfaces with some other software layer(s) or component(s) to support the virtualization. Moreover, in some systems, some virtualization code is included in at least one “superior” VM to facilitate the operations of other VMs. Furthermore, specific software support for VMs may be included in the host OS itself. Unless otherwise indicated, the invention described below may be used in virtualized computer systems having any type or configuration of virtualization software.
Moreover, FIG. 1 shows virtual machine monitors that appear as separate entities from other components of the virtualization software. Furthermore, some software components used to implement one illustrated embodiment of the invention are shown and described as being located logically between all virtual machines and the underlying hardware platform and/or system-level host software. These components will usually be part of the overall virtualization software, although it would be possible to implement at least some part of them in specialized hardware. The illustrated embodiments are given only for the sake of simplicity and clarity and by way of illustration—as mentioned above, the distinctions are not always so clear-cut. Unless otherwise indicated, the invention described below may be used in virtualized computer systems having any type or configuration of virtualization software. Moreover, the invention is described and illustrated below primarily as including one or more “VMMs” that appear as separate entities from other components of the virtualization software. This is only for the sake of simplicity and clarity and by way of illustration—as mentioned above, the distinctions are not always so clear-cut. Again, unless otherwise indicated or apparent from the description, it is to be assumed that the invention can be implemented anywhere within the overall structure of the virtualization software, and even in systems that provide specific hardware support for virtualization.
The various virtualized hardware components in the VM, such as the virtual CPU(s) 210-0, 210-1, . . . , 210-m, the virtual memory 230, the virtual disk 240, and the virtual device(s) 270, are shown as being part of the VM 200 for the sake of conceptual simplicity. In actuality, these “components” are usually implemented as software emulations 330 included in the VMM. One advantage of such an arrangement is that the VMM may (but need not) be set up to expose “generic” devices, which facilitate VM migration and hardware platform-independence.
Different systems may implement virtualization to different degrees—“virtualization” generally relates to a spectrum of definitions rather than to a bright line, and often reflects a design choice with respect to a trade-off between speed and efficiency on the one hand and isolation and universality on the other hand. For example, “full virtualization” is sometimes used to denote a system in which no software components of any form are included in the guest other than those that would be found in a non-virtualized computer; thus, the guest OS could be an off-the-shelf, commercially available OS with no components included specifically to support use in a virtualized environment.
In contrast, another concept, which has yet to achieve a universally accepted definition, is that of “para-virtualization.” As the name implies, a “para-virtualized” system is not “fully” virtualized, but rather the guest is configured in some way to provide certain features that facilitate virtualization. For example, the guest in some para-virtualized systems is designed to avoid hard-to-virtualize operations and configurations, such as by avoiding certain privileged instructions, certain memory address ranges, etc. As another example, many para-virtualized systems include an interface within the guest that enables explicit calls to other components of the virtualization software.
For some, para-virtualization implies that the guest OS (in particular, its kernel) is specifically designed to support such an interface. According to this view, having, for example, an off-the-shelf version of Microsoft Windows XP™ as the guest OS would not be consistent with the notion of para-virtualization. Others define para-virtualization more broadly to include any guest OS with any code that is specifically intended to provide information directly to any other component of the virtualization software. According to this view, loading a module such as a driver designed to communicate with other virtualization components renders the system para-virtualized, even if the guest OS as such is an off-the-shelf, commercially available OS not specifically designed to support a virtualized computer system. Unless otherwise indicated or apparent, this invention is not restricted to use in systems with any particular “degree” of virtualization and is not to be limited to any particular notion of full or partial (“para-”) virtualization.
In addition to the sometimes fuzzy distinction between full and partial (para-) virtualization, two arrangements of intermediate system-level software layer(s) are in general use—a “hosted” configuration and a non-hosted configuration (which is shown in FIG. 1). In a hosted virtualized computer system, an existing, general-purpose operating system forms a “host” OS that is used to perform certain input/output (I/O) operations, alongside and sometimes at the request of the VMM. The Workstation product of VMware, Inc., of Palo Alto, Calif., is an example of a hosted, virtualized computer system, which is also explained in U.S. Pat. No. 6,496,847 (Bugnion, et al., “System and Method for Virtualizing Computer Systems,” 17 Dec. 2002).
As illustrated in FIG. 1, in many cases, it may be beneficial to deploy VMMs on top of a software layer—a kernel 600—constructed specifically to provide efficient support for the VMs. This configuration is frequently referred to as being “non-hosted.” Compared with a system in which VMMs run directly on the hardware platform, use of a kernel offers greater modularity and facilitates provision of services (for example, resource management) that extend across multiple virtual machines. Compared with a hosted deployment, a kernel may offer greater performance because it can be co-developed with the VMM and be optimized for the characteristics of a workload consisting primarily of VMs/VMMs. The kernel 600 also handles any other applications running on it that can be separately scheduled, as well as a console operating system that, in some architectures, is used to boot the system and facilitate certain user interactions with the virtualization software.
Note that the kernel 600 is not the same as the kernel that will be within the guest OS 220—as is well known, every operating system has its own kernel. Note also that the kernel 600 is part of the “host” platform of the VM/VMM as defined above even though the configuration shown in FIG. 1 is commonly termed “non-hosted;” moreover, the kernel may be both part of the host and part of the virtualization software or “hypervisor.” The difference in terminology is one of perspective and definitions that are still evolving in the art of virtualization.
In order to more efficiently utilize memory resources in a computer system, the concept of virtual memory is often used. For example, FIG. 2 illustrates virtual memory management and address mapping functions performed by the VMM 300 and other various components of a virtualized computer system. The guest OS 220 generates a guest OS page table 292. The guest OS page table 292 contains mappings from GVPNs (Guest Virtual Page Numbers) to GPPNs (Guest Physical Page Numbers). Suppose that a guest application 260 attempts to access a memory location having a first GVPN, and that the guest OS 220 has specified in the guest OS page table 292 that the first GVPN is backed by what it believes to be a “real,” physical memory page having a first GPPN. The mapping from the first GVPN to the first GPPN is used by the virtual system hardware 201, and it is loaded into a VTLB (Virtual Translation Look-Aside Buffer) 294 which operates as a cache for the frequently accessed mappings from the GVPN to the GPPN.
A virtualized computer system typically uses a second level of address indirection to convert what the guest OS treats as a “real” address in physical memory into an address that in fact is an address in the hardware (physical) memory. The memory management module 350 thus translates the first GPPN into a corresponding actual PPN (Physical Page Number), which, in some literature, is equivalently referred to as an MPN (Machine Page Number). This translation is typically carried out by a component such as a so-called BusMem/PhysMem table, which includes mappings from guest physical addresses to bus addresses and then to physical (hardware, or “machine”) addresses. The memory management module 350 creates a shadow page table 392, and inserts a translation into the shadow page table 392 mapping the first GVPN to the first PPN. In other words, the memory management module 350 creates shadow page tables 392 that function as a cache containing the mapping from the GVPN to the PPN. This mapping from the first GVPN to the first PPN is used by the system hardware 100 to access the actual hardware storage device that is backing up the GVPN, and is also loaded into the TLB (Translation Look-Aside Buffer) 194 to cache the GVPN to PPN mapping for future memory access.
Note that the concept of “virtual memory” is found even in non-virtualized computer systems, where “virtual page numbers” are converted into “physical page numbers.” One effect of the second level of address indirection introduced in a virtualized computer system is thus that the guest physical page numbers, which the guest OS thinks refer directly to hardware are in fact treated by the underlying host OS (or similar system-level component) as virtual page numbers, which are again remapped into hardware memory. To avoid any confusion that might result from the terms “virtual memory” and “virtual page number,” etc., being used even in literature describing non-virtualized computer systems, and to keep terminology as consistent as possible with convention, GVPNs and GPPNs refer here to the page numbers generated within the guest, and PPNs are the page numbers for pages in hardware (machine) memory.
FIG. 3 illustrates a conventional swap space in a virtualized computer system. As illustrated above, by using a virtual memory scheme, each VM in a virtualized computer system is given the illusion of being a dedicated physical machine, with dedicated “physical” memory. System administrators of the virtualized computer system may over-commit memory, so that the aggregate amount of virtualized “physical” memory presented to the VMs exceeds the total amount of actual physical memory 130 on the host 100. When memory is overcommitted, the system typically needs to reclaim memory from some VMs, so that the amount of host physical memory 130 allocated to each VM may be less than the amount of virtualized “physical” memory presented to the VM. One memory reclamation technique is “transparent swapping” (also known as “paging”), in which the contents of some memory is moved (“swapped out”) to a “backing store” or a “swap space,” such as disk storage, instead of remaining resident in the host's physical memory 130. When the VM later accesses memory data that have been swapped out, the virtualization system “swaps in” the relevant portion of the memory, possibly causing other portions of the memory to be swapped out. Some systems also specify an amount of VM virtualized “physical” memory that is guaranteed to be backed by actual host physical memory (i.e., the reserved part of the physical memory for a VM). In order to ensure that the virtualized computer system is able to preserve the contents of all VM memory under any circumstances, swap space must be reserved for any remaining VM memory, i.e., for the difference between the VM's virtualized “physical” memory size and the size of its guaranteed reserved memory.
In conventional virtualization systems, swap space is allocated from a common pool of disk storage associated with the host. Referring to FIG. 3, the physical memory 130 of the host hardware 100 uses a common swap space 350 to swap the contents of the memory 130 when the memory 130 is over-committed to the VMs 200-1, 200-2, . . . , 200-N. At the control of the kernel 600 or other similar virtualization software, the contents of the memory 130 are swapped out to the common swap space 350 when the memory 130 is over-committed to free up the memory 130. Also, the former contents of the memory 130 are swapped back into the memory 130 from the common swap space 350 if the VMs 200-1, 200-2, . . . , 200-N attempt to access the content.
Note that the physical (hardware) memory 130 for all of the VMs 200-1, 200-2, . . . , 200-N is backed by a single, common swap space 350, although the common swap space 350 may be physically comprised of different disks, partitions, or files. Therefore, the content from the memory 130 corresponding to the various VMs 200-1, 200-2, . . . , 200-3 may be swapped out to the common swap space 350, mixed up with one another, and there is no particular part of the common swap space 350 that is dedicated for swapping content from portions of the memory 130 only corresponding to a particular VM 200-1, 200-2, . . . , 200-N. In other words, the common swap space 350 is a “per-host common pool” and the swap spaces for all VMs on the host are grouped together into a single logical space. This presents a number of problems.
First, if a VM is live-migrated from one physical host to another physical host while the VM is powered on, then any VM memory that is currently swapped out to the common swap space 350 must be swapped back in from the source host's swap storage 350 to the physical memory 130, putting pressure on the memory 130. Extra cycles of the CPU 110 are needed to handle the swap in requests. This leaves the host computer system with less overall CPU cycles and storage bandwidth, which will negatively affect the performance of other VMs running on the host computer system. Even worse, swapping back in all of the migrating VM's memory data will increase the amount of total physical host memory used, which could result in the host computer system swapping out other VMs' memory to the common swap space 350, thus degrading their performance even further. The content that is swapped back into the memory of the source host should then be copied to the memory of the destination host, which may itself need to swap it out to the destination host's common swap space. In short, VM migration could be very disruptive to the host computer system as a whole when a common “per-host” swap space 350 is used for all the VMs running on the host.
Second, another disadvantage of the common swap space 350 is that the size of the per-host swap space has to be pre-calculated by the administrator of the host computer system. It has to be big enough to support all the VMs running on the host but not too big such that there is unused, wasted swap space. This is an administrative burden that is likely to lead to a sub-optimum size of the common swap space 350.
Third, another disadvantage of the common swap space 350 is that access control can only be applied to the common swap space 350 as a whole. This means that by having access to the swap space, one has access to the swapped memory of all the running VMs, which is not desirable from a security standpoint.
Fourth, using a per-host common pool for swap space also prevents administrators and users of the host computer system from controlling where in the swap space 350 the swapped memory for different VMs will be placed, and the related quality-of-service parameters. For example, an administrator of the host computer system may want to place the swap space for high-priority VMs on a highly-available high-performance disk array, and place the swap space for low-priority VMs on cheaper, slower disks, which is not possible to implement with the conventional common swap space 350 for all the VMs. Similarly, an administrator of the host computer system may want to provide additional features, such as hardware-based encryption, to the swap space for some VMs but not for other VMs, which is not possible to implement with the conventional common swap space 350 for all the VMs.
Therefore, there is a need for swap space for swapping the physical memory in a host computer system, where VMs using the swap space can be migrated to another physical host efficiently and quickly. There is also a need for swap space for swapping the physical memory in a host computer system, where the swap space for different VMs can be controlled separately. There is also a need for providing finer-grained controls of the swap spaces on the VM level rather than on a per-host level.