Virtualization involves a way to run multiple virtual machines (VMs) on one or more devices or systems. When using virtualization, extra processing power(s) and/or storage(s) on a device may be more efficiently used by sharing it between multiple VMs. Typically these systems are constructed in a manner that allows programs running within a VM to operate without knowledge that the VM is sharing resources with other VMs. Besides the interoperability, VMs also consider security concerns. Typically, I/O virtualization solutions provide the same isolation that was found when the VM was operating on a separate physical machine. Isolation involves separation of memory space, input/output (I/O) streams, interrupts, and the ability to isolate control operations, I/O operations, and errors.
There are many available computer I/O interconnect standards. One of the I/O interconnect standards has been the peripheral component interconnect (PCI) standard. The PCI allows the bus to act like a bridge, which isolates a local processor bus from the peripherals, allowing a Central Processing Unit (CPU) of the computer to run faster. A successor to PCI (termed PCI Express or PCIe) provides higher performance, increased flexibility and scalability for next-generation systems, while maintaining software compatibility with existing PCI applications. Compared to legacy PCI, the PCI Express protocol is more complex, with three layers, i.e. the transaction layer, the data link layer and the physical layer.
In a PCI Express system, a root complex device connects a processor and a memory subsystem to a PCIe switch fabric having one or more switch devices. In a PCI Express, a point-to-point architecture is used. Similar to a host bridge in a PCI system, the root complex generates transaction requests on behalf of the processor, which is interconnected through a local I/O interconnect. Root complex functionality may be implemented as a discrete device, or may be integrated with the processor. A root complex may contain more than one PCI Express ports and multiple switch devices may be connected to ports on the root complex or cascaded. FIG. 1 shows an exemplary standard PCIe device 100, having such as three different functions, each with its own physical resources, respectively, as well as an internal routing 103, configuration resources 105, and a PCIe port 107. PCIe functionality shared by all functions is managed through function 0. A PCIe device may typically support up to 8 functions.
The Single-Root Input/Output Virtualization (SR-IOV) standard was introduced to standardize a way to share PCIe devices in a way that virtualization goals are still met. SR-IOV provides a mechanism by which a single root function (such as a single Ethernet port) may appear to be multiple separate physical devices. In this manner, a port leading to a PCIe device may be shared between multiple VMs, thus effectively sharing the PCIe devices between the VMs without either VM needing to be aware of the existence of the other. A SR-IOV-capable device (such as a PCIe endpoint) may be configured to appear in the PCI configuration space as multiple functions, each with its own configuration space complete with Base Address Registers (BARs). A VM manager (VMM) assigns one or more virtual functions to a VM by mapping the actual configuration space of the virtual functions to the configuration space presented to the VM by the VMM.
SR-IOV introduces the concepts of physical functions and virtual functions. A physical function is a PCIe function that supports the SR-IOV capability. A virtual function is a lightweight function that is associated with a physical function but that may be assigned to a particular VM. In other words, one or more virtual functions may be assigned to one VM. All of this capability is managed through the VMM in coordination with the component in the hypervisor that manages the SR-IOV virtual functions. FIG. 2 shows a schematic view of an exemplary PCIe SR-IOV capable device. In FIG. 2, the PCIe SR-IOV capable device 200 has two physical functions, and each physical function (PF) has three virtual functions respectively. In reality, there may be any number of physical functions (up to device limits), and each physical function may have a respective number of associated virtual functions. While an SR-IOV allows multiple VMs within a single host to share physical resources. There is no capability to allow VMs across multiple hosts to share physical resources. An SR-IOV only allows a single root complex, and thus a single host, to share resources of an attached PCIe device.
With PCIe devices expanding rapidly, it is now more standards to have devices, such as switches, connecting multiple hosts to multiple PCIe devices. It would be advantageous to allow these multiple hosts to share PCIe endpoint functions, because it would allow for the PCIe endpoint functions to be dynamically provisioned among the hosts to meet workload requirements. One solution is known as Multi-Root Input/Output Virtualization (MR-IOV). This scheme has been standardized. If one were to try and implement it on a new switch, the lack of availability of MR-IOV compatible PCIe endpoints would make such a switch virtually useless.
An existing solution, i.e. Non-Transparent-Bridge (NTB) device, is described that uses resource redirection methods when multiple hosts are connected using the non-transparent ports of a PCIe switch that supports shared I/O mechanisms. FIG. 3 shows an exemplary schematic view illustrating sharing virtual functions to multiple hosts through NTB devices. As seen in FIG. 3, when the multiple hosts are connected using the non-transparent ports of a PCIe transparent switch 301, each NTB device allows the multi-root sharing of endpoint functions using the existing SR-IOV standard that is in use by a large number of devices, thus having the advantages of MR-IOV without needing to actually implement MR-IOV.
FIG. 4 shows an exemplary schematic view illustrating both the physical and virtual hierarchies for a single host's sharing of a plurality of SR-IOV endpoints. In FIG. 4, the physical structures includes a non-transparent port 400 connected to a host 402, a transparent PCI-to-PCI bridge 404 of the upstream port, and the global space/management hierarchy 406 where the SR-IOV endpoints connect. For each of downstream ports 408, 410 and 412 in the management hierarchy 406 that connects to one of share endpoints 414, 416 and 418, there is a corresponding emulated virtual PCI-to-PCI bridge of virtual PCI-to-PCI (P-P) bridges 420, 422 and 424, respectively. The emulated virtual PCI-to-PCI bridges' registers are located in a memory 426 of a management processor 428. These registers are accessed by redirecting control and status register (CSR) requests to the management processor 428.
Normally, a data moving operation between application(s), kernel(s), driver(s) and device hardware may be operated as follows. For example, in the case that data is moved from an application to a device, a driver in the kernel space or the application may allocate a data buffer when it need to do a data move operation. The application may get an address for the data buffer from the driver, and move the data to the data buffer when the data is ready. The driver then triggers such as a DMA operation of hardware by putting the address for the data buffer into a DMA transmitter/receiver descriptor in the device. The device may issue the DMA operation and gets the data. For the other direction, the driver allocates the data buffer and puts the address into the DMA transmitter/receiver descriptor when the data is coming. The device will access the data into the data buffer after the data is ready. The application also gets the data from the data buffer through the address of the data buffer which comes from the driver. This scheme is called zero-copy because there is no other data copy between the application and the device. In the current architectures of the memory usage of the shared device driver for a DMA operation between the application and the shared device, the NTB mapped buffer is fixed and the allocated data buffer of the driver or the application could be anywhere in the RAM memory, so there needs another data copy from the data buffer to a NTB mapped buffer.
Normally one SR-IOV device may support more than such as 100 virtual functions but only 20% or 30% of virtual machines in one server. Most of the virtual functions are wasted. To use the SR-IOV virtual functions more efficiently, it may design that multiple hosts and virtual machines may share the virtual functions of SR-IOV devices in a more intuitive, secure and transparent way. Just like the virtual functions are really plugged in each host.