1. Field of the Invention
This invention relates to virtual computers. More particularly, this invention relates to improvements in a cluster-based symmetric multiprocessor.
2. Description of the Related Art
The meanings of certain acronyms and terminology used herein are given in Table 1.
TABLE 1APIApplication programming interfaceCPUCentral processing unitDMADirect Memory Access - used by hardware devices,which are required to copy data toand from main system memory. DMA is used torelieve the CPU from waiting during memoryaccesses.False sharingIn shared memory multiprocessors, when processorsmake references to different dataitems within the same block even thoughthere is no actual dependence between thereferences.FSBFront-side busNICNetwork interface cardNUMANon-uniform memory accessPCIPeripheral Component Interconnect - a standardfor peripheral software and hardwareinterfaces.SMPSymmetric multiprocessorTLBTranslation lookaside bufferVMVirtual machineVMMVirtual machine monitor
A portion of the disclosure of this patent document, which includes a CD-ROM appendix, contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The use of virtual computers (generally referred to as “virtual machines”) to enhance computing power has been known for several decades. For example, a classic system, VM, produced by IBM, enabled multiple users to concurrently use a single computer by running multiple copies of the operating system. Virtual computers have been realized on many different types of computer hardware platforms, including both single-processor and multi-processor units.
Some virtual machine monitors are able to provide concurrent support for diverse operating systems. This requires the virtual machine monitor to present a virtual machine, that is a coherent view of the hardware, to each operating system. The above-noted VM system has evolved to the point where it is asserted that in one version, z/VM®, available from IBM, New Orchard Road, Armonk, N.Y., multiple operating systems can execute on a single server.
Despite these achievements in virtual computing, practical issues remain. The currently dominant personal computer architecture, X86/IA32, which is used in the Intel Pentium™ and other Intel microprocessors, is not conducive to virtualization techniques for two reasons: (1) the instruction set of the CPU is not natively virtualizable; and (2) the X86/IA32 architecture has an open I/O architecture, which complicates the sharing of devices among different operating systems. This has been an impediment to continued advancements in the field. In general, it is inefficient, and probably impractical, for multiple operating systems to concurrently share common X86/IA32 hardware directly. System features of the X86/IA32 CPU are designed to be configured and used in a coordinated effort by only one operating system, e.g., paging and protection mechanisms, and segmentation.
Limitations of the X86/IA32 architecture can be appreciated by a brief explanation of one known approach to virtual computers, in which a virtual machine monitor is used to provide a uniform execution environment within a computer. A virtual machine monitor is a software layer that in this approach is interposed between hardware of a single computer and one or more guest operating systems that support different applications. In this arrangement the virtual machine monitor interacts directly with the hardware, and exposes an expected interface to the guest operating systems. This interface includes normal hardware facilities, e.g., CPU, I/O, and memory.
When virtualization is properly done, the guest operating systems are unaware that they are interacting with a virtual machine instead of directly with the hardware. For example, low level disk operations invoked by the operating systems, interaction with system timers, interrupts and exception handling are all managed transparently by the guest operating systems via the virtual machine monitor. To accomplish this, it is necessary that the virtual machine monitor be able to trap and execute certain hardware instructions dealing with the state of the processor.
Significantly, the X86/IA32 employs four modes of protected operation, which are conveniently conceptualized as rings of protection, known as protection rings 0-3. Protection ring 0 is the most protected, and was designed for execution of the operating system kernel. Privileged instructions available only under protection ring 0 include instructions dealing with interrupt handling, and the modification of processor flags and page tables. Typical examples are store instructions for the global descriptor table (SGDT) and interrupt descriptor table (SIDT). Protection rings 1 and 2 were designed for other operating system services, e.g., device drivers. Protection ring 3, the least privileged, was intended for applications, and is also referred to as user mode. If it were possible to trap all of the privileged X86/IA32 instructions in user mode, it would be relatively straightforward for the virtual machine monitor to handle them using ordinary exception-handling techniques. Unfortunately, there are many privileged instructions of the X86/IA32 instruction set, which cannot be trapped under protection ring 3. Attempts to naively execute privileged instructions under protection ring 3 typically result in a general protection fault.
Because of the importance of the X86/IA32 architecture, considerable effort has been devoted to overcoming its limitations with regard to virtualization. Virtual machines have been proposed to be implemented by software emulation of at least the privileged instructions of the X86/IA32 instruction set. Alternatively, binary translation techniques can be utilized in the emulator. Binary translation techniques in connection with a virtual machine monitor are disclosed in U.S. Pat. No. 6,397,242, the disclosure of which is incorporated herein by reference. Additionally or alternatively, combinations of direct execution and binary translation can be implemented. The open source Bochs IA-32 Emulator, downloadable via the Internet at the URL http://bochs.sourceforge.net/, is an example of a complete emulator. Another example is the SimOS environment, available via the Internet at the URL http://simos.stanford.edu/. The SimOS environment is adapted to the MIPS R4000 and R10000 and Digital Alpha processor families. Generally, the performance of emulators is relatively slow.
Another known approach employs a hosted architecture. A virtual machine application uses a VM driver to load a virtual machine monitor at a privileged level. Typical of this approach are the disclosures of U.S. Pat. Nos. 6,075,938 and 6,496,847, which are incorporated herein by reference. The virtual machine monitor then uses the I/O services of a host operating system to accommodate user-level VM applications. Current examples of this approach include the VMware Workstation™, the VMware GSX Server™, both available from VMware, Inc., 3145 Porter Drive, Palo Alto, Calif. 94304, and the Connectix Virtual PC™, available from Microsoft Corporation, One Microsoft Way, Redmond, Wash. 98052-6399. Another example is the open source Plex86 Virtual Machine, available via the Internet. The hosted architecture is attractive due to its simplicity. However, it incurs a performance penalty because the virtual machine monitor must itself run as a scheduled application under the host operating system, and could even be swapped out. Furthermore, it requires emulators to be written and maintained for diverse I/O devices that are invoked by the virtual machine monitor.
It is known in the art to use multiple processors in a single computer in order to enhance overall system performance. One known architecture is symmetric multiprocessing (SMP), in which application programs are processed by multiple processors that share a common operating system and memory. Typically, the processors share memory and the I/O bus or data path, and are controlled by a single instance of an operating system. In order to enhance performance, SMP systems may employ non-uniform memory access (NUMA), a method of configuring the microprocessors so that they can share memory locally.
In a variation of multiprocessing systems, multiple relatively small computers, either uniprocessors or multiprocessors having relatively few processors, are linked together and coordinated to execute multiple applications, while serving one or more users. This arrangement is known as a cluster, or scaled-out arrangement. Some systems of this type can outperform corresponding SMP configurations. However, in the past it has been necessary that applications for cluster-based systems be specialized, so that they are cluster-aware. This has increased development expense, and in some cases, has impeded the use of standard commercial software on cluster-based systems.
An unsuccessful attempt to implement a VM computing paradigm on cluster-based systems is disclosed in the document The Memory and Communication Subsystem of Virtual Machines for Cluster Computing, Shiliang Hu and Xidong Wang, January 2002 (Hu et al.), published on the Internet. In this proposed arrangement, multiple SMP clusters of NUMA-like processors are monitored by virtual machine monitors. A cluster interconnect deals with message passing among the clusters. The system consists of multiple virtual machines that operate under a single operating system, and support parallel programming models. While a virtual computer built according to this paradigm would initially appear to be highly scalable, preliminary simulations of the communication and memory subsystems were discouraging. A further difficulty is posed by limitations of current operating systems, which are generally unaware of the locality of NUMA-type memory. According to Hu et al., the proposed paradigm could not be reduced to practice until substantial technological changes occur in the industry. Thus Hu et al. appears to have encountered a well-known difficulty: cluster machines generally, and NUMA machines in particular, can be scaled up successfully only if some way is found to ensure a high computation to communication ratio in regard to both data distribution and explicit communication among the clusters and processors.
The most successful of the solutions noted above, in the case of the IBM z/VM product, have relied upon revisions and optimizations of the underlying computer hardware in order to overcome the issues encountered by Hu et al., and to increase performance generally, or have required kernel modifications of operating system software, in the case of the above-noted VMWare products. These approaches are costly in terms of product development, marketing, and maintenance, and often commercially impracticable, due to secrecy policies of operating system software vendors.