Packet network communication adapters are a central element in new high-speed, packetized, serial IO bus architectures that are gaining acceptance in the computer industry. In these systems, computing hosts and peripherals are linked together by a switching network, commonly referred to as a switching fabric, taking the place of parallel buses that are used in legacy systems. A number of architectures of this type have been proposed, culminating in the “InfiniBand™” (IB) architecture, which is described in detail in the InfiniBand Architecture Specification, Release 1.0 (October, 2000), which is incorporated herein by reference. This document is available from the InfiniBand Trade Association at www.infinibandta.org. A host connects to a switching fabric (e.g. the IB fabric) via a host channel adapter (HCA).
Computer system virtualization is known. It allows multiple operating systems and processes to share the hardware resources of a host computer. The system virtualization infrastructure provides resource isolation so that each guest operating system (OS) does not realize that it is sharing resources with another OS and does not adversely affect the execution of the other OS. Such system virtualization enables applications including server consolidation, co-located hosting facilities, distributed web services, applications mobility, secure computing platforms and other applications that provide for efficient use of underlying hardware resources. However, the existing, known virtualization is exclusively implemented in software (SW).
FIG. 1a shows schematically a virtualized server 100 comprising a hardware (HW) section IO device 102, guest domains 104, CPU and memory functions 106 and at least one virtual switching element having isolation and load balancing functions (also referred to throughout the description as a “privileged virtual switching element”, exemplarily a “Hypervisor”) 108 interconnected as shown. Hypervisor is described in detail in, for example, the Xen open source implementation available at www.xensource.com, or in the IBM Redbook publication “Advanced Power Virtualization on IMM @server p5 Servers”, Sep. 27, 2004. In prior art before InfiniBand, all IO accesses (both datapath and control) had to go through the Hypervisor SW layer. In FIG. 1a, the thick full lines illustrate datapath access, while the thin full ones illustrate control access. The necessity to go through SW for every IO access adversely affects performance and increases the demands on a CPU. The InfiniBand channel architecture solves the problem only partially, enabling direct IO access on the datapath IO operations (see below).
The IO device may exemplarily be a HCA, to which the description refers henceforth in detail. However, it should be clear that the present invention applies equally well to other IO devices, for example IO devices in architectures such as Ethernet, Fiber Channel, TCP/IP, etc. “HCA” is thus used throughout the description as representing any smart IO device with functionality as described, while Hypervisor is used throughout the description as representing any virtual switching element with functionality as described.
An exemplary HCA is shown in FIG. 1b. It comprises a command queue (CQ) 120 and a plurality (up to 16M) of data queues (DQ) 122. The CQ is used to set policies, i.e. access rights and restrictions to use the HCA resources (queue pairs, translation tables, etc) to access a network directly from the application. The data queues implement IO channels to access the IO device. The multiple HCA resources (work queues) are used to execute datapath operations and to implement cross-queue isolation and protection. The protection and isolation policy is set by a single control entity (CQ) owned by a privileged driver. There is therefore a single “control interface” to configure resources and multiple control interfaces to use them.
In SW virtualization as provided by prior art, each OS becomes a “non-trusted” instance and therefore every IO access goes through the privileged entity (e.g. Hypervisor), which validates access rights. There is no direct access of a guest domain to the IO mechanism. As a result, there are multiple (and not one) kernel transitions for IO access on behalf of an application running on a guest (virtualized) server. In addition, Hypervisor implements a “virtual network” in software to enable cross-domain communication and isolation of different domains. The issue here is not only trust, but also resource management and isolation. The guest domain is not aware of his fellow domains using the same resources. Hypervisor allocates resources per domain and (in absence of HW aids) also enforces this allocation. In order to do the latter, it is involved in every IO operation. While using the HCA, Hypervisor can set up policies and the HCA HW will enforce them. The datapath operation can also be off-loaded to Hypervisor with a ‘standard” HCA, but since control operations become too heavy, this solution is inefficient.
FIG. 1c shows a prior art non-virtual server environment, in which a HCA 102 is being accessed directly for both control and datapath operations from a domain (server) 130.
An InfiniBand FICA device driver in a virtualized server is different from a driver in a non-virtualized server. With IB, the datapath accesses can go directly to the IO device, while the control still goes through Hypervisor. InfiniBand defines access to hardware for IO access on the datapath, and isolation and protection is a part of its specification. Thus, a standard-compliant InfiniBand HCA enables exporting an IO channel for data transfer to each operating system, off-loading this “virtual network” from Hypervisor to the HCA HW. Since some control operations (e.g. memory registration) are performance-critical, this solution does not solve the entire problem of IO access from a guest domain, and part of the IO access operation has to go in software through Hypervisor. Such a software based approach has two drawbacks:
1. All control operations on behalf of the application in the guest domain must be done in Hypervisor, which incurs kernel transitions, and:
2. The InfiniBand SW (driver) running in the guest domain must be changed to run datapath operations directly to the HW, while for control operations it must generate a call to Hypervisor
The first drawback has performance implications, since going through multiple kernel transitions is prohibitive. The second drawback does not enable to run the same OS distribution on non-virtualized and virtualized servers.
There is therefore a widely recognized need for, and it would be highly advantageous to have, a hardware based solution that enables acceleration of IO access operations on a virtual machine, while reducing CPU load. It would furthermore be advantageous to enable a guest domain to execute control (and not only datapath) operations by direct access to hardware.