1. Field of the Invention
This invention relates to task scheduling mechanisms in system-level computer software, especially in the context of virtualized computer systems.
2. Background Art
The advantages of virtual machine (VM) technology have become widely recognized. Among these advantages is the ability to run multiple virtual machines as “guests” on a single “host” platform. This makes better use of the capacity of the hardware, while still ensuring that each user enjoys the features of a “complete,” isolated computer. Depending on how it is implemented, virtualization also provides greater security since it can isolate potentially unstable or unsafe software so that it cannot adversely affect the hardware state or system files. This and other advantages are also provided by virtualization even in systems with only a single virtual machine. Computer virtualization is described in greater detail below.
A disadvantage of running multiple VMs on a single platform is that the problems faced by single machines—virtual or physical—are also multiplied, especially since the code defining all processes must eventually be executed on the same physical processor(s). One such problem is that each VM includes at least one, and possible several, virtualized processors, each of which may spend significant time idling.
Modern operating systems generally place idle processors into a tight loop that continuously checks for the presence of new tasks by examining a runnable queue, which contains a list of tasks or processes that can be dispatched by the idle processors. Idle processors may potentially spend prolonged time periods “spinning” in such idle loops when the system load is light. This is common for operating systems executing directly on the underlying hardware as well as for the guest operating systems executed inside a VM.
As mentioned above, in a virtualized computer system, there may be many VMs executing simultaneously on the same hardware platform. Each such VM may contain a guest operating system that spends a significant portion of its execution time in an idle loop. This scenario is particularly common for contexts where virtualization is used to consolidate multiple lightly loaded physical servers into a single server running multiple VMs: The consolidation is performed precisely because the system load for each individual server is not sufficient to warrant a separate physical machine. In such environments it is imperative that the virtualization infrastructure be capable of making intelligent scheduling decisions across VMs—VMs that have runnable tasks to perform must be preferentially scheduled on physical hardware relative to the VMs spinning in idle loops. Ideally, a VM in an idle loop should consume as little of the physical resources as possible and should be scheduled only when it is ready to exit the idle loop and perform useful work.
Multiprocessor VMs make the potential spinning problem worse. A single idle VM may have multiple virtual CPUs spinning in respective idle loops and consuming resources of multiple physical processors. Indeed, a single idle VM with sufficiently many virtual processors may potentially starve all other VMs even on a large multiprocessor system.
While intelligent scheduling of idle VMs is necessary for maximizing the overall throughput of virtualization systems, it is hard to accomplish in a fashion transparent to the guests. In particular, it is hard to determine which VMs are executing in their respective idle loops. VMs may, for example, be running different guest operating systems (Windows, Linux, Solaris, etc.) with different service packs or patches installed.
One way to accomplish this is to export special application program interfaces (APIs) to the VMs' guest operating systems to signal the virtualization environment when the guest is entering or leaving its idle loop. However this would violate the goal of transparency—the guest operating systems would need to be modified in order to perform well inside such a virtualization environment. It is desirable to achieve the performance goal even where the guest operating system is an unmodified, stock operating system.
Intel Corp. has recognized the impact of spinning on system performance and has introduced certain hardware mechanisms in order to reduce this impact in Intel Xeon and Pentium 4 processors. Intel Xeon and Pentium 4 chips currently account for the bulk of IA-32 compatible units shipped annually.
Intel recommends the use of a PAUSE instruction in all spin-wait loops that run on Intel Xeon and Pentium 4 processors. The spin-wait loops include operating system idle loops. Because the PAUSE instruction is treated as a “no-operation” NOP instruction in earlier IA-32 processor generations and does not require CPUID checks, it was quickly adapted by many operating systems (Windows 2000 family, Linux, FreeBSD, etc.). On physical hardware, the PAUSE instruction placed in a tight polling loop provides the following benefits: 1) it provides a hint to the processor that the executed code sequence is a spin-wait loop in order to avoid a memory order violation and to prevent the pipeline flush; 2) it frees up execution resources that may be used by other logical threads if the processor supports hyper-threading; and 3) it reduces the power consumption by the processor.
The disadvantage of using spin loops in the context of multiple VMs, even in the presence of the PAUSE instruction, is that an idle VM will continue to consume processor resources while starving other VMs: An idle VM will continue to spin (with reduced power consumption, etc.) until the VM's scheduling quantum expires, at which point the VM is descheduled and another VM is scheduled in its place. Fully idle VMs will spend their entire scheduled quanta spinning in the idle loop, preventing other VMs from executing runnable tasks. The use of the PAUSE instruction in itself does not solve the problem of scheduling idle VMs in multi-VM environments.
Intel also recommends explicitly halting a processor by means of the HLT instruction if it remains in a spin-wait loop for a long time. Excessive transitions into and out of the halt state could, however, incur performance penalties and operating systems are advised to evaluate performance trade-offs for their specific contexts before halting. In many instances, the idle loop may eventually halt the processor via HLT, but only after spending a substantial time in the spin-wait idle loop based on the PAUSE instruction.
Still another Intel recommendation is that spin-wait loops be based on the following example, which implements a “test, test-and-set” algorithm (expressed here using standard Intel instruction abbreviations):
Spin Lock: CMP lockvar, 0; Check if lock is free JE Get_Lock PAUSE; Short Delay JMP Spin_LockGet_Lock: MOV EAX, 1 XCHG EAX, lockvar; Try to get lock CMP EAX, 0; Test if successful JNE Spin_LockCritical_Section: <critical section code> MOV lockvar, 0 . . . Continue:
The disadvantage of using such a spin-wait loop in multi-VM environments is the same as when using any other spin-loop based solutions: An idle VM will continue spinning and using processor cycles that could be used by other VMs with runnable tasks.
What is needed is therefore a way to reduce the waste of the physical processor resource associated with existing mechanisms for scheduling multiple idling processes and that is suited for providing more efficient allocation of the processor resources in the case of virtualized multi-processor systems. This invention provides a way to do this.