In a computer system, the term virtualization means hiding an implementation of components or abstracting details. In a rudimentary computing environment of a single tasking, one single program (or task) controls the entire computer system.
With the advent of multi-tasking computers, an operating system (OS) facilitated the efficient sharing of hardware devices across multiple tasks. An OS primarily provides functionalities for process management, memory management, and device management. For process management, an OS runs one task at a time on a central processing unit (CPU) for a predetermined quantum of time until the task is preempted by the OS to relinquish the CPU time for another task at the end of the predetermined time. For memory management, regardless of the size of the physical memory available on the computer system, the OS allows each task to have the full addressable range of memory, so that each task can address the entire address space. The mapping of physical addresses to virtual addresses of a given task is controlled by a kernel of the OS through a mechanism referred to as demand paging. During the execution of a task, all references to the code and data locations are given with respect to their virtual addresses. In early computer architectures, the translation of virtual address to physical address was performed in software, therefore the translation was painstakingly slow.
To overcome the performance limitation of software virtual address translation, processors (e.g., INTEL® i386) started to use hardware page tables to transparently perform the translation between virtual addresses and physical addresses. To facilitate faster context switching between the user process and the OS kernel via system calls, many OS'es such as Linux started to map the kernel virtual address space into the address space of the task itself. For instance, in 32-bit Linux, three-fourths of memory (0x00000000 through 0xbfffffff) is assigned for the user address space and one-fourth of memory (0xc0000000 through 0xffffffff) is assigned for the kernel address space.
The OS permits that each task has an exclusive control over CPU and memory during the time slice of its execution. However, for other devices such as graphics processing unit (GPU), storage devices, network interface card (NIC), the OS directly manages these devices and exercises discretion to ensure the appropriate use of these devices. For example, some devices may need to be exclusively used by one task (e.g., a printer), while others may be concurrently shared among multiple tasks. Some device operations need to be performed atomically while others may be interleaved.
S/360 system by IBM® launched in 1964 was the first attempt of system virtualization of the physical computer. System virtualization makes multiple instances of guest OS'es to run on the same hardware by a supervisory software layer called a hypervisor or a virtual machine manager (VMM). The hypervisor or VMM is interchangeably referred to as a host. Original system virtualization ran the OS in a de-privileged mode (i.e., non-supervisor mode). Based on their mode of deployment, hypervisors are classified into two types. Type 1 hypervisor boots directly on the bare metal (like a classical OS) and brings up the guest OS'es on top of the hypervisor layer. Examples of type 1 hypervisor include, but are not limited to, VMWARE® ESX hypervisor and XEN® hypervisor. Type 2 hypervisor, also referred to as a hosted hypervisor, runs inside a host OS that boots on the bare metal but the actual hypervisor is a user-mode component. Examples of type 2 hypervisor include, but are not limited to, VMWARE® Desktop, and Kernel Virtual Machine (KVM), and FreeBSD BHyVe.
During the early days of system virtualization, compute virtualization, i.e., virtualization of CPU and memory, posed technical challenges. For CPU virtualization of INTEL®/AMD® x86, when an OS runs in a de-privileged level, some sensitive instructions behave differently in the lower privilege levels without faulting. If such instructions had faulted (as happens in “trap-and-emulate” processor architectures), the hypervisor or host would get the opportunity to control and fix the anomaly. For example, if the OS runs in a lower privilege (e.g., Ring 1) than the designated privilege level (e.g., Ring 0), the processor simply executes these sensitive x86 instructions in Ring 1 with different semantics instead of faulting. Dynamic translation and OS paravirtualization techniques were devised to deal with such sensitive instructions. Later processor manufacturers (e.g., INTEL®, AMD®)) came up with efficient hardware architectures to handle CPU virtualization, for example, INTEL® virtualization technology (VT) and AMD-V, wherein such sensitive instructions raise a special trap giving control to the hypervisor to enforce the correct semantics of these instructions.
For memory virtualization, a guest virtual address that is translated to a guest physical address requires an additional level of translation to access physical memory destination of the host. Efficient hardware architectures such as INTEL® extended page table (EPT) and AMD® nested page table (NPT) address memory virtualization by providing hardware support for translating guest virtual address directly to host physical address.
After the compute virtualization was harnessed with efficient hardware architecture, the focus of the computing industry shifted to I/O virtualization. I/O virtualization involves virtualization of devices such as GPUs, storage devices, and NICs. Depending on the deployment type for system virtualization, there are three tiers of I/O virtualizations.
Tier 1 I/O virtualization is connectivity virtualization. Tier 1 I/O virtualization focuses on the optimization of the data center floor to improve the efficiency of physical connectivity, cabling, routing/switching, power distribution etc. For example, XSIGO® data center fabric minimizes physical connections across servers and provides a high speed and low-latency interconnect among servers.
Tier 2 I/O virtualization is hardware device virtualization. Tier 2 I/O virtualization focuses on making multiple virtual hardware endpoints available for use across multiple physical servers. Peripheral Component Interconnect Special Interest Group (PCI-SIG) defines standards for single root I/O virtualization (SR-IOV) and multi root I/O virtualization (MR-IOV). Both SR-IOV and MR-IOV aim at making single physical devices such as a GPU or NIC behave as if they are composed of multiple logical devices. Each of the multiple logical devices of a physical device, referred to as virtual functions (VFs), appears to OS'es as a virtual device such as an individual GPU or NIC. Each VF is exclusively assigned to a guest OS. Tier 2 I/O virtualization also involves PCI Express (PCIe) virtualization, for example, VirtenSys and Aprius. VirtenSys extends PCIe bus outside a group of servers to a switch from which PCIe-connected peripherals such as Ethernet NICs and fiber channel HBAs are shared by the servers, avoiding each of them requiring their own NIC and HBA. Aprius allows servers to share peripheral devices at PCIe bus speeds over a virtual PCIe bus network.
Tier 3 I/O virtualization is software device virtualization that runs inside the server boxes based on hypervisors or VMMs. Tier 3 I/O virtualization focuses on enhancing the overall scalability and utilization of devices like GPUs, storage devices and NICs. Tier 3 I/O virtualization enables concurrent use of I/O devices by multiple guest OS'es.
Initially, tier 3 I/O virtualization used to emulate hardware devices in software. A virtual device driver that is loaded into a guest OS emulates device operations in software by communicating with a software layer in the host (e.g., a hypervisor). The virtual device driver cooperates with the native device drivers of the host to perform the I/O operations. Software device virtualization is generally slow because virtual device drivers are not designed to exploit device-specific optimization (e.g., hardware acceleration). However, software emulation provides good platform coverage because no specific knowledge of the hardware device is required.
The next advancement in tier 3 I/O virtualization was device paravirtualization. Device paravirtualization employs a split-driver architecture by providing a front-end driver in the guest OS and a back-end driver in the hypervisor or host. The back-end driver, also referred to as a VMM driver interface, works with the native device driver of the host or hypervisor. Paravirtualized drivers can be generic (e.g., class drivers such as network, block drivers) or device-specific. When paravirtualized drivers have device-specific intelligence, they permit guest OS'es to exploit hardware acceleration available in the actual hardware device. Thus, paravirtualization enables concurrent access to a hardware device yet providing close to native performance. To achieve best performance, device-specific paravirtualization requires each device manufacturer to write paravirtualized split-drivers for each device/OS/hypervisor combination. Due to the requirements for paravirtualized split-drivers and prohibitive development and sustenance costs, manufacturers slowly distanced away from device paravirtualization as a solution for software device virtualization. However, because hardware device virtualization (e.g., SR-IOV) drivers require guest-host collaboration with high amount of device-specific intelligence to perform operations such as coordinating power management of devices, split-drivers of paravirtualization still remains a viable solution for I/O virtualization.
The next advancement in tier 3 I/O virtualization was direct device assignment. INTEL and AMD added hardware support for device virtualization. INTEL® VT for directed I/O (VT-d) and AMD's I/O memory management unit (IOMMU) allow a single guest OS instance to exclusively own a device (e.g., a GPU, a storage device, a NIC) while none of the other guests or even the host would be able to use the device while the device is in use. The guest OS may use a native device driver to control the device while VT-d and IOMMU took care of performance issues in software device virtualization such as DMA redirection and interrupt redirection. This allows for a single guest OS to achieve close to native performance for the device, but the exclusive ownership of the device hindered the acceptance of the direct device assignment by the virtualization community. For this reason, direct device assignment is also referred to as a “fixed pass through.”
VMWARE®-mediated pass through is a specialized case of direct device assignment (or fixed pass through) that exploits internal architecture details of devices. For example, GPUs support multiple independent contexts and mediated pass-through proposes dedicating just a context, or set of contexts, to a virtual machine (VM) rather than the entire GPU. This enables multiplexing but incurs additional costs. The GPU hardware must implement contexts in a way that they can be mapped to different virtual machines with a low overhead and the host/hypervisor must have enough knowledge of the hardware to allocate and manage GPU contexts. In addition, if each context does not appear as a full logical device, the guest device drivers must be able to handle it. Mediated pass-through lacks interposition features beyond basic isolation. A number of tactics using paravirtualization or standardization of a subset of hardware interfaces can potentially unlock these additional interposition features. For example, the publication entitled “TA2644: Networking I/O Virtualization,” VMworld 2008 by Howie Xu, et al. contemplated analogous techniques for networking hardware.
PCI-SIG provides single root I/O virtualization (SR-IOV) that allows device manufacturers to create a single physical device that behave like multiple devices. An SR-IOV device has a single physical function (or physical device) controlled by the hypervisor or VMM, and multiple virtual functions (or virtual devices) each of which can be assigned exclusively to a guest OS. In the case of direct device assignment, VT-d or IOMMU assumes the responsibility for DMA and interrupt redirection. SR-IOV provides better concurrency in the use of the device but still restricted by the finite number of virtual functions that could be accommodated on the hardware device. SR-IOV is gradually gaining adoption in the virtualization community although data centers have to go through extensive infrastructure changes to benefit from SR-IOV.
Nokia contemplated tier 3 device virtualization solution using a system call bridge in United States Patent Application No. 2013/0072260 entitled “Method and Apparatus for Facilitating Sharing Device Connections.” The system call bridge is built on the assumption that if a guest OS were to remotely make system calls to the host OS (with appropriate translations in the case of heterogeneous OS'es), host devices could be transparently shared on the guest OS'es. This is a process referred to as system call virtualization. However, system call virtualization that remotes only device operations is impractical or undesirable because the process execution, memory management, information and device management, in that case, will be entirely performed by the host OS. Devirtualization was conceived as a special case of a system call bridge where the operations on selected device files alone are remotely called by the host OS. For example, United States Patent Application No. 2013/0204924 entitled “Method and Apparatus for Providing Application Level Transparency via Device Devirtualization” describes devirtualization.
Devirtualization popularized paravirtualization by removing the need for one driver per each of device/OS/hypervisor combinations. By removing device-specific knowledge from a paravirtualized driver, a single pair of generic (i.e., front-end and back-end) drivers can be used to virtualize many types of devices (e.g., GPUs, sensors) while facilitating (1) the concurrent use of the device across guest OS'es, resulting in higher scalability and utilization of the device and (2) hardware acceleration offered by the device to be used by guest OS'es, resulting in close to native performance. Devices such as GPUs or sensors that do not require a fast response without high volumes of asynchronous operations or DMA/interrupts greatly benefit from devirtualization. Since the devirtualization drivers are devoid of knowledge of any specific devices, the guest OS is required to redirect the virtual file system (VFS) operations for the devirtualized devices (e.g., Linux file_operations) to the devirtualization client driver that works in tandem with the devirtualization host driver on the virtualization host to operate on host devices through the host native device drivers.
Devirtualization virtualizes devices in shared memory domains (e.g., single computers) as well as distributed memory domains (e.g., across a network of computers). For shared memory domains, devices are shared between guest OS'es running on a hypervisor on a shared memory system, thus it is an intrinsic devirtualization. On the other hand, for distributed memory domains, devices are shared between multiple discrete computers (e.g., between a smartphone and a tablet), thus it is an extrinsic devirtualization. Devirtualization has its own limitations, but most importantly devirtualization fails to provide coherent user space device interfaces (e.g., entries in Linux /dev, /sys, /proc filesystems) because the device-specific knowledge out of these drivers was abstracted in favor of genericity of device virtualization. A technical report entitled “Making I/O Virtualization Easy with Device Files” by Ardalan Amiri Sani, et al., Technical Report 2013 Apr. 13, Rice University, April 2013. describes the limitations of devirtualization.
System virtualization infrastructures (e.g., XEN®, KVM, VMWARE® VMI) provided the efficient communication mechanism for a guest OS to context switch into the host. These are similar to system calls that allow applications context switch into the kernel. Context switches can be achieved by software interrupts or VMCALL. Software Interrupts are similar to the system calls and switch to the appropriate ring level to gain the privilege to perform host operations. INTEL® VT provides the VMCALL instruction for a guest to perform an immediate context switch to the host. In VMCALL instruction, one of the arguments indicates a special function that the guest wants the host to perform on its behalf, and the rest of the arguments are operation-specific.
Address space virtualization achieved a significant performance gain for an intrinsic devirtualization. Address space virtualization provides a hybrid address space (HAS) that includes a single address space for the host kernel and guest user mappings while performing devirtualized system call operations in the host and allowing the host kernel to directly access system call arguments (and other information) via virtual address pointers in the guest user application's memory space.
The use of HAS allows enhanced page sharing across OS domains in hypervisor-based system virtualization. Prior examples of page sharing architecture include, but not limited to, XEN® grant tables and VMWARE® transparent page sharing (TPS). For XEN® grant tables, selected memory mappings are shared across guest OS domains (sometimes with the host) to avoid redundant copies of data dealt by device drivers. For VMWARE® transparent page sharing (TPS), when multiple guest OS instances of the same OS function simultaneously, a large number of pages remain identical. The hypervisor shares the backing physical (copy-on-write) pages in the virtual address space of the different guest OS instances. HAS-based page sharing enables a host kernel to directly access any portions of the guest application memory.
The performance of devices such as GPU, storage and NIC usually limits the user experience on a computer system whether it is a physical computer or a virtual computer running on a hypervisor. Operating systems such as Windows, Linux, MacOS, iOS and Android provide native device drivers as closed-source or binary distributions. Some device manufacturers make available an open-source version of their drivers, but they usually withhold many of intellectual property of their drivers. An efficient software device virtualization architecture works seamlessly and transparently across multiple devices, even when only binary level closed-source drivers are available. In such a case, the software device virtualization architecture particularly precludes any specific knowledge about the devices, or access to sources for the devices drivers to be able to efficiently perform software device virtualization.
Dynamic device virtualization (DDV) aims at enhancing I/O performance of application programs running on virtual machines. DDV uses dynamically generated (e.g., cloned) device-specific virtual device drivers for virtual machines (guest processes/threads) based on observing the execution of the host native drivers. In addition, DDV performs zero-copy (direct) I/O in the execution context of the guest processes/threads, by directly accessing the guest user memory from the host kernel based on various address space virtualization techniques (e.g., hybrid address space, kernel address space partitioning, dynamic translation).
DDV is a software device virtualization technique that allows multiple guest operating systems (OS) to concurrently access hardware devices of a computer such as graphics processing units (GPU), storage, network interface card (NIC). Software device virtualization enhances scalability and utilization of hardware devices without requiring special hardware optimization (e.g., single root I/O virtualization (SR-IOV) from PCI special interest group (SIG)). A device manager of DDV running on a supervisory software layer observes a behavior of a native device driver of a hardware device loaded on the host, and clones one or more virtual device drivers to run in the guest OS context. The virtual device driver directly invokes device driver interface (DDI) interfaces (callbacks) implemented by the native device driver, and performs the device management chores on the host that was originally meant to be performed only by the native device driver. Thus, the native device driver is virtually shared between the host and the guest OS domains. The execution context of the native device driver on the host is virtually extended into each of the guest OS contexts. Although DDV provides transparency between host devices and guest applications, virtual device drivers must be dynamically cloned for each guest operation systems and guest applications.