The Peripheral Component Interconnect (PCI) standard has continued to meet the needs of CPUs and IO (Inputs/Outputs) devices by increasing the performance while maintaining backward compatibility. In 2002, the PCI-SIG (www.pcisig.com) introduced a new physical implementation of PCI, called PCI Express (abbreviated as PCIe hereinafter). PCIe has a signaling rate of 2.5 Gbaud or an effective data rate of 2.0 Gb/s (due to the 8b/10b encoding) per lane. PCIe is scalable (i.e., multiple lanes can be combined to provide x4, x8, x16 and higher bandwidth), and therefore, can deliver the performance required for next-generation 10 Gb Ethernet (10 GbE) and Fibre Channel IO adapters.
PCI Express was originally designed for desktops connecting a root complex (a host CPU with memory) with downstream IO devices, but has since found applications in servers, storage devices, and other communications systems. The base PCIe switching structure of a single root complex has a tree topology, which addresses PCIe endpoints through a bus numbering scheme.
There has been much progress over the last few years in the areas of the virtualization of computation resources and storage. Virtual machine (VM) technology has emerged to provide the ability to run multiple virtual servers on a single physical server while sharing the physical CPU and memory resources of the physical server. VM technology has basically been driving new CPU architectural development. CPU vendors are now providing CPUs with increasing number of cores, which are particularly well suited for running multiple virtual machines.
A virtual machine is defined as a software implementation of a machine (computer) that executes programs like a real machine. Virtualization refers to the abstraction of computer resources, and is a technique of hiding the physical characteristics of computing resources from the way in which other systems, applications, or end users interact with those resources. CPU power has been doubling every 18 months following Moore's Law. Server virtualization is a way to leverage the exponential growth of CPU power. When a physical server is virtualized, it results in multiple logical servers with each logical server comprising a virtual machine. A system image is a software component running on the virtual machine. It is called system image because it can be closed down and resumed operations later at exactly the same states when left previously. A system image is assigned to a specific virtual machine. Since each system image (SI) is associated with a virtual machine, system images and virtual machines are used interchangeably in the following description.
IO capacity has been lagging CPU performance. In order to match the IO capacity to the CPU performance growth, one way is to increase the server physical size (large, expensive rack) to accommodate more network interconnections such as Ethernet network interface cards (NICs), InfiniBand host channel adapters (HCAs), and Fibre Channels (FC) host bus adapters (HBAs). The situation has been recognized by chip vendors and PCI-SIG to develop virtual IO standards to allow multiple operating systems on a given machine to natively share PCIe devices. The concept is to assign multiple virtual machines to a multi-function device having high-speed IOs such as InfiniBand, Fibre Channel or 10 GbE (10 Gigabit Ethernet).
The progress in virtualization of IO connectivity has not been able to keep up with the technological advance of multi-core CPUs. A physical server contains a limited number of physical ports (e.g., Ethernet NICs for LAN access, Fibre Channel HCAs for SAN access). Because server IO connectivity is fixed, the server IO capability cannot be scaled in real-time according to demand. An increase in bandwidth requires physical intervention, for example, through a manual insertion of NICs or physical replacement of current NICs with ones having higher bandwidth. Even if a sufficient number of physical endpoints is available, this rigid topology leads to system inefficiencies because it is optimized only for one type of applications; and if the server is re-targeted for other applications, the IO connectivity needs to be re-configured. And physical removal of a NIC causes the existing system state to reset.
Upgrading the network infrastructure by replacing the current IO interface modules with state-of the art and more expensive ones generally does not provide system flexibility because the increased IO capacity, if implemented to meet peak traffic for a certain application, will remain most of the time underutilized. Sharing physical IO resources through IO virtualization (IOV) appears to be a good solution for adapting to the increasingly use of multi-core processors in servers. IO virtualization allows virtual machines to share expensive high-bandwidth IOs such as 10 Gb Ethernet or 8 Gb Fibre Channel, and hence justifies their deployment.
The PCI-SIG Working Group is developing a new specification that adds IO virtualization capability to PCI Express. The new specification in development defines two levels of IO virtualization: the single-root IO virtualization (SR-IOV) and the multi-root IO virtualization (MR-IOV). The SR-IOV provides a standard mechanism for endpoint devices to advertise their ability to be simultaneously shared among multiple virtual machines running on the same hardware platform (one host CPU). The MR-IOV allows sharing of an IO resource between multiple operation systems on multiple hardware platforms (multiple host CPUs).
The IO virtualization provides a means to datacenter managers and network administrators to use the existing resources more efficiently, e.g., they can allocate more physical endpoints to a virtual machine when it requires additional bandwidth. FIG. 1 shows an SR-IOV topology. A single-root PCI Manager (SR-PCIM) software is added to the server computer system to virtualize and manage system resources. The SR-PCIM software maps each system image to a specific virtual function inside an endpoint. The physical function is equivalent to a native PCIe function with the additional capability of IO virtualization, i.e., it can contain multiple virtual functions. The single-root PCIe switch may comprise multiple PCIe switches coupled in a tree topology, with each switch equivalent to a native PCIe switch. The SR-PCIM software is running on the host, i.e., it utilizes the host CPU resources. The physical function PF is a PCIe function (per the PCI Express Base Specification) that supports the SR-IOV capability. A virtual function associated with a physical function (e.g., VF0, VF1 in PF0 or in PF1) must be the same device type as the physical function.
FIG. 2 shows an MR-IOV topology. In order to support the multi-root topology, PCIe switches and IOV devices need to be MR aware (i.e., they are capable of supporting a multi-root system). MR-aware IO adapters and PCIe switches must have additional register sets to support the various root-complex routings, and an MR-aware PCIe switch must contain two or more upstream ports. In contrast to the SR-IOV specification, which does not change the data link or transaction layers specified in the PCI Express Base Specification, the MR-IOV specification requires modifications in the data link layer. There is also a change in the configuration software to configure the switch fabric and the MR-aware endpoints. The MR-PCI Manager can be implemented above a root complex or sideband off the MR-aware switches. The MR-aware PCIe switches can be interconnected in a variety of topologies: star, tree and mesh.
In summary, current IO adapters and current PCIe devices do not have IO virtualization capabilities. They are designed to be controlled by a single device driver in a single OS (operation system) kernel. Hence, a PCIe device provides all its bandwidth to a single OS running on the physical CPU. Current VM software does not allow operating systems to access IO devices directly, so all IO operations are handled by a VM Manager (VMM) or hypervisor. Server virtualization results in increased IO utilization because virtual machines (system images) need to connect to different data and storage networks. The proposed IOV approaches are handled in software by the PCI Manager which is responsible for the IO resource assignment and may not be the most efficient solution. In the SR-IOV architecture, multiple system images are sharing a PCIe IOV endpoint. There are two problems with this approach: 1) One physical NIC may be shared by multiple VMs and therefore be overloaded. The system has no capability to share another NIC dynamically to distribute the load; and 2) NICs, when dynamically added, may not get utilized immediately and, a NIC, when physically removed, causes the existing system to reset. In the case of MR-IOV architecture, new types of PCIe switches and new types of PCIe endpoint devices need to be deployed. Furthermore, the PCIe endpoint can become the bottleneck in both proposed IOV topologies because the proposed IOV specifications do not support the spreading of virtual functions across multiple physical devices.
FIG. 3 shows the topology of a cluster of system images connected to a PCIe switching cloud. The PCIe switching cloud (also referred to as PCIe switched fabric) comprises a plurality of PCIe switches, which can be conventional such as those shown in FIG. 1 or MR-aware such as those shown in FIG. 2. The PCIe switching cloud is coupled to a plurality of network interface cards. Multiple system images are sharing a GbE NIC to access a local area network (LAN) or a FC HBA to access a storage area network (SAN). The virtual machine monitor or hypervisor provides each system image with a physical media access control (MAC) address to connect the system image to the physical NIC through the PCIe switched cloud. IO devices such as NICs can support multiple virtual functions. To the server computer system, each of these virtual functions appears as a separate PCIe function which can be directly assigned. In the given example, system images SI 1 to SI K are assigned to the GbE NIC 331. The physical NIC 331 supports virtualization and represents K different virtual NICs. If system images 1 to K exceed the bandwidth of the NIC 331, the current system can't dynamically allocate resources by adding the second GbE NIC 332 to assist the traffic flow. And if data center managers and network managers discover that system image K is the one that generates the most traffic, there isn't a central management mechanism that allows a reallocation of the system image K to the NIC 332 without affecting the routing setup of the complete system.
Therefore it is desirable to balance the traffic over each NIC so that no one NIC does handle too much traffic (this is referred to as load balancing). One way of implementing load balancing is to use a round-robin approach, where the server sends out a first data packet using a first NIC, a second data packet using a second NIC, and so on. However, the round-robin approach is problematic because multiple data packets are typically associated with a given session (a transaction between a system image and a NIC), and they are now sent through different NICs, hence, the packets will arrive at the destination out-of-order. An alternative approach is to use randomized algorithms which assign packets “randomly” to available NICs. The randomized approach faces the same issues that packets will be received out-of-order. Yet another approach is the MAC-based approach where multiple data packets associated with the same session are assigned the same MAC address, but this will lead to traffic congestion on the assigned NIC if the system image has a high bandwidth demand that exceeds the NIC capability.
As the use of virtual machines (VMs) in server environments grows, and as server computer systems use multi-core hosts and multiple hosts, it may be necessary to have a dedicated host running the VM manager to coordinate the configuration of all root complexes, all PCIe switches and all IO adapters and to assign communication bandwidth to system images according to their traffic demand. Embodiments described below provide systems and methods to enable each VM on the server to access underlying physical IO devices coupled to the PCIe switching cloud.