1. Field of the Invention
This invention relates very broadly to computer operating systems, and in particular to a method for allocating at least one system resource that is shared among two or more guest computer systems, which are connected to a host computer system that contains the shared resource. The invention is particularly applicable where the guest systems are virtual machines communicating with an underlying physical computer architecture via a virtual machine monitor and/or similar virtual or real operating system.
2. Description of the Related Art
Every modern computer system includes some I/O or peripheral device, as well as applications. A keyboard, a mouse, a graphics or sound card, a printer, and a disk drive are all examples of such devices, and there are at least thousands of different kinds of applications. Each such device or application requires some system resource, such as processor (CPU) time, I/O bandwidth on the various data buses, and physical memory space.
As is well known, every general-purpose computer also includes an operating system (OS) not only as a common, standardized software platform on which other programs (applications) can run, but also to manage the various system resources, to ensure that different applications do not interfere with each other, to provide security, and to keep track of files and directories used to organize the memory resource, just to name a few of the tasks handled by the OS. Managing and allocating the resources of the volatile and non-volatile storage, such as the system RAM and the hard disk(s), is central to any operating system. Accordingly, software modules to provide such scheduling and related resource management are usually included in the “kernel” of the OS.
Except in very specially designed systems, each device such as a printer, a disk drive, a mouse, a sound card, etc., must be defined within the OS by a software module known as a “device driver,” or, simply a “driver.” The design of drivers for different types of devices and applications is well understood in the field of computer science. Two books that described driver design in detail for the Windows NT and Linux operating systems, respectively, are “Windows NT Device Driver Development” by Peter G. Viscarola and W. Anthony Mason, McMillan Technical Publishing, 1999; and “Linux Device Drivers” by Alessandro Rubini, O'Reilly and Associates, Inc., 1998.
As is well known, a driver typically handles and/or forwards interrupts generated by the respective device, and it also operates as a translator between the device and programs that use the device: Each device typically has its own set of specialized commands that the OS can interpret and carry out, whereas most programs access the device using generic commands. Thanks to this arrangement, it is possible to load many different programs (for example, different word processing programs) that share the same printer, although the printer driver itself will be specific to a particular operating system (such as Unix) or class of operating systems (such the Windows 95, 98, NT and 2000 series).
Many drivers for standardized devices such as a keyboard typically come with the OS. For other devices, one must load a new driver when the device is connected to the computer. For example, printers usually come with a floppy disk, CD ROM (or Internet link for downloading) that, when loaded into the proper drive, automatically installs the needed driver into the OS.
As will become clear from the discussion of the invention below, one particularly important feature of an OS is that it allocates memory resources to the various devices and running applications. For example, a graphics-intensive application will typically require much more available memory than the keyboard. The amount of memory needed by each device is passed by the driver to the OS as a parameter. This requirement for memory (including disk space) may be constant, such as for a keyboard, or it may change during the operation of the device. For example, if a device needs more memory, its driver will issue a corresponding request to the OS, which then handles the request. If enough system memory is available, then the OS will allocate the requested space. In cases where not enough system memory available, the OS will then either swap other memory pages out to the much slower disk and return the newly-freed memory to satisfy the request, or it will return an error if swapping out is insufficient or impossible. In the common multitasking environment, several different applications may be functionally connected to the OS, each having different and often conflicting needs for various system resources. As a simple example, two different programs might issue requests for data transfer from a hard disk very closely together in time. Similarly, two different programs may need to transfer so much data via a modem or other I/O device that their requests exceed the bandwidth of the device. The OS must then schedule the two competing requests for use of the resource (memory and I/O bandwidth, respectively) according to some predetermined administrative policy, such as not exceeding a maximum time of interruption of execution of either program.
The problem of competition for a system resource such as memory may become particularly acute in systems where many “guest” computer systems (physical or virtual) are connected to a central system that contains and manages the actual, physical devices (including memory) and thus acts as a “host” system. As the number of guest systems increases, so too does the probability and frequency of conflicting demands for the common resource(s) of the host.
The concept of a “host” system includes both host hardware and host software. These two aspects of the host system are explained in greater detail below, but in order to avoid possible confusion of terminology, is summarized as follows: The host hardware, as the name implies, will include one or more processors and the various physical devices. The host software will typically include some form of host operating system, as well as other conventional system software as needed.
For example, in the context of virtual machine technology, several virtual machines (VM's) often run via respective virtual machine monitors (VMM's) on a common underlying hardware platform and must share physical resources. Each VM thus constitutes a guest system, and typically includes its own guest operating system. One or more applications then usually run in each guest. Among other functions, the VMM intercepts and converts requests for physical resources by the VM's into corresponding requests for actual hardware resources. In some systems, the VMM itself performs the functions of an operating system and is responsible for managing all the physical system resources such as CPU time, physical memory, and I/O bandwidth. In other systems, the VMM forwards the requests to the underlying host operating system.
In either case, from the perspective of a VM, the VMM, or some other equivalent software module, is part of the host system, although it is included specifically to monitor and manage a VM. Because the VMM is completely or at least substantially transparent to the VM(s) it manages, but normally not to the host hardware and/or host OS, it is therefore referred to here as being part of the host software, along with or instead of the host OS, even if the VMM is “packaged” with its corresponding VM.
Regardless of the nature of the guest systems, machine resources must be shared in a controlled manner that prevents any guest system (such as a VM) from monopolizing resources to the exclusion of other guest systems. More generally, the system should guarantee that each guest will receive a predictable share of resources as defined by some administrative policy.
One obvious way to ensure this is simply to allocate to each guest system a fixed share of each resource. For example, in systems that have the well-known Intel x86 architecture, memory is divided into fixed-size units called “pages.” Similar division of memory in fixed-size pages is found in other known modern processor architectures as well, such as MIPS, PowerPC, and Alpha, although the actual size of the pages may vary. For example, the page size in the x86 architecture is 4 Kbytes, but in the Alpha architecture it is 8 Kbytes. This invention does not require any particular page size; indeed, it does not require division of the memory into pages at all.
In a static allocation scheme, the host, via the VMM's, simply allocates to each VM a predetermined, fixed number of memory pages for its use, essentially partitioning the memory. The biggest disadvantage of this static scheme is obvious: Some VM's may be allotted more memory than they need, whereas others may be forced to use the much slower hard disk much more often than they should, just because their fixed share of the fast system memory is too small. Consequently, it is usually preferable to track each VM's need for memory (or other resource) and to allocate memory dynamically.
Dynamic management of space-shared resources generally involves some form of resource revocation. Since the availability of a system resource such as memory is essentially “zero-sum,” conceptually, when one guest system is granted more space, the host (for example, a VMM, the host OS, or some other software component included in the host software) must select a “victim,” which must then relinquish some of its previously allocated space. The revocation of physical memory can be decomposed into two conceptual steps. First, a memory scheduling mechanism (the “memory scheduler”) must select a victim guest (such as a VM) from which to revoke memory. Second, the memory scheduler must choose the particular pages of memory to be revoked from the victim.
One disadvantage of existing dynamic resource allocation mechanisms is that the guest operating system running, for example, in each VM has much better information about which of its allocated memory pages (or analogous units of memory) contain the least time-critical data and are therefore more suitable to be swapped out to a slower storage device such as the hard disk: Guest operating systems often employ sophisticated memory management algorithms that exploit knowledge about internal system or application characteristics that is not available to the VMM. A guest OS may, moreover, even have excess pages available on its internal free list.
Another disadvantage of known dynamic resource allocation mechanisms is that they require relatively complicated and costly data structures and algorithms within the host software in order to track how much of the resource each guest (here: VM) has and needs and how best to allocate the resource among the competing guests. Increasing the efficiency of such systems requires more and more information about each guest to be incorporated and organized within the host. Of course, this not only increases the complexity of the allocation mechanism, but it also makes it relatively inflexible: If a new guest is connected to the host, then its local information, which might be quite different from that of other connected guests, must also be incorporated into whatever component of the host software that schedules resources. Moreover, guest-specific information needed to make decisions about specific pages to revoke differs not only across operating systems (for example, Linux versus Windows NT), but also across different versions (“service packs,” etc.) of the same vendor's OS.
Existing virtual machine operating systems also often suffer from a problem known as “redundant paging.” Redundant or “double” paging can occur due to the use of two independent levels of paging algorithms—one by the guest OS within the VM, and another in whichever host software component the guest communicates with, such as the VMM, the host operating system, or some equivalent mechanism. As a result, some pages of memory may be swapped out to disk twice, or may be read in by one level and immediately paged out by the other. Avoiding this inefficiency typically requires the implementation of a special swap file in the guest OS, which is monitored by the VMM (or equivalent mechanism in the host). The problem of double paging, and one proposed solution using a swap file, is described in “Cellular Disco: resource management using virtual clusters on shared-memory multiprocessors”, Kinshuk Govil, Dan Teodosiu, Yongqiang Huang, and Mendel Rosenblum, Proceedings of the 17th ACM Symposium on Operating System Principles (SOSP '99), Dec. 12-15, 1999, pp. 154-169.
Analogizing to the field of economics, one might therefore say that the prior art relies on centralized decision-making about the allocation of the limited resource (here: fast system memory), which fails to adequately take specific “local” information into account. Moreover, centralization requires a relatively large and inflexible “bureaucracy,” which, is this case, translates to various mapping tables and other data structures, as well as tracking algorithms, which leads to waste and inefficiency not only locally, but also globally.
One attempt in the prior art to decentralize the choice of which particular page to reclaim from a particular application involves a technique known as “application-level paging.” This technique is described, for example, in “Self-Paging in the Nemesis Operating System,” Steven M. Hand, Proceedings of the Third Symposium on Operating Systems Design and Implementation (OSDI '99), February 1999, pp. 73-86. According to this technique, the OS chooses a particular application process as the “victim,” meaning that it must relinquish memory pages. The process then itself selects which of its pages to relinquish.
One disadvantage of application-level paging is that it requires an explicit interface and protocol to allow the OS and the various applications to cooperate in selecting pages to be relinquished. The page revocation mechanism in application-level paging is therefore not transparent to the OS.
Another disadvantage of application-level paging is that it presupposes that the various applications are sophisticated enough to make good decisions about which pages to relinquish. This will often not be true in many common applications. Put differently, application-level paging involves an attempt at decentralization that puts the primary responsibility for controlling page revocation at a level where the “decision-makers” are either not designed to make the decision efficiently, or are not designed to make the decision at all.
What is needed is therefore a way to quickly and efficiently allocate a system resource, in particular, memory, among several guest systems connected to a host system that is able to utilize local information about the various guest systems, that easily adapts to changes in the number of guest systems, that avoids the problem of double paging, that is essentially transparent to the guest OS, and that does so without the need for complicated, extensive centralized data structures and tracking algorithms. This invention accomplishes this.