For a server that runs more than one guest operating system (OS) or virtual machine (VM), a hypervisor can abstract access to an external Ethernet network by implementing an emulated network interface card (NIC) or virtual NIC (vNIC) and presenting it as a regular NIC to each guest OS. The hypervisor can talk to a physical NIC (pNIC) and “translate” access from the vNIC to the pNIC. In the process, the hypervisor can add value such as filtering, rate limiting and access control. Conceptually, the pNIC can be viewed as an uplink port to the hypervisor and to the physical Ethernet network. The multiple guest OSs, which previously could have been running on separate physical machines (with or without a hypervisor), can in theory communicate with each other through an external Ethernet switch, but cannot since existing Ethernet switches today do not loop the packet back onto the same port on which they arrived to prevent routing loops.
FIG. 1 illustrates an exemplary conventional peripheral component interconnect (PCI) or PCI express (PCIe) hypervisor system 100. A server 102 may contain a central processing unit (CPU) 104 with one or more cores, the CPU coupled to memory 106 such as dynamic random access memory (DRAM). A pNIC 108 allows the server 102 to communicate with a network through a port 110 on a switch such as an Ethernet switch 112. The switch 112 may be coupled to storage devices 114 and other servers 116 (with their own OS) through other ports on the switch. The server 102 may be able to implement one or more guest OSs 118 (any combination of Windows, Linux or other OSs), each running one or more applications. To share the resources of the CPU 104, a hypervisor 120 abstracts the underlying hardware from the guest OSs, and essentially time-shares the OSs with the CPU. vNICs 122 allow the guest OSs 118 to interface with the hypervisor 120 and ultimately the pNIC 108.
If any of the applications running on the guest OSs 118 wants to communicate with a device in the network, packets can be routed through the vNIC 112 of the guest OS, through the pNIC 108, and out to the switch 112 for routing through port 110. On the other hand, if one application wants to communicate with another application running in the same server 102, by definition the guest OSs 118 must still communicate with each other through the normal networking stack. However, because the guest OSs 118 share a common pNIC 108, and because network switches 112 do not allow packets to be looped back onto the same port 110, the hypervisor 120 cannot rely on the network switch so perform the necessary switching.
Because the hypervisor 120 cannot rely on the switch port 110 to do the switching, a virtual switch (vSwitch) 122 can be employed in the hypervisor to connect the vNICs together and perform switching between them. The vSwitch 122 can implement the routing function and route packets from one application to another without needing to involve the pNIC 108 or a network switch 112.
While this approach works well and is scalable to any number of guest OSs (because the vSwitch is essentially software), there is a CPU utilization penalty that is paid for performing memory copies and hypervisor intervention for every input/output (I/O) operation. To route data, data residing in the virtual memory space assigned to the source OS must be copied to the virtual memory space assigned to the destination OS. Media access control (MAC) addresses in the request to transfer data uniquely identify the network adapters of the source and destination virtual machines. However, because the CPU must be involved in all network traffic, CPU utilization suffers as cycles are consumed and are unavailable for running the guest OSs. Also, memory bandwidth is wasted due to the copying step.
FIG. 2 illustrates an exemplary PCIe system 200 with multi-root I/O virtualization (IOV). In FIG. 2, a blade server 202 may have blades 0-15 (identified with reference character 204 in FIG. 2), a shared PCIe switch blade 206, and a shared I/O blade including a pNIC 208. The pNIC may be capable of connecting to an external switch 210 through 10 GB Ethernet ports 212, for example. The pNIC 208 can be coupled through a PCIe interface 214 to the shared PCIe switch, which can then be connected to each of the blades 204 through additional PCIe interfaces 216. Each blade 204 may contain the usual server components such as a CPU, memory such as DRAM, and storage. Each blade may also include a hypervisor 218 for running multiple VMs or guest OSs 220 in each blade. However, in a bladed (multi-root) environment with a shared I/O module (shared pNIC 208 connected to a single Ethernet port as shown in FIG. 2), because a hypervisor vSwitch 222 does not span multiple blades 204, guest OSs 220 across different blades have no means of communicating with each other (while preserving the traditional networking stack).
Shared PCIe switch 206 is where switching between blades and even within blades can occur. Note that the PCIe switch 206 does not contain Ethernet data processing capabilities, and therefore the PCIe switch by itself is not able to handle Ethernet traffic between the server blades.
FIG. 3 illustrates an exemplary PCIe system with single root (SR) IOV. In the example of FIG. 3, which is a single server scenario running a hypervisor 318, each guest OS 320 gets direct access to I/O through PCIe 316 without hypervisor involvement, which improves CPU utilization because the hypervisor 318 doesn't trap network traffic. In this I/O pass-through model where the pNIC 308 is capable of supporting SR IOV and therefore exposes multiple virtual functions (VFs), the hypervisor 318 does not have to act as an intermediary for I/O transactions, and the pNIC 308 is directly exported to the guest OSs as VFs. However, because the hypervisor 318 is not involved, it cannot perform any vSwitch functionality. Therefore, in this embodiment, the switch functionality must reside in either the pNIC 308 or a port in the network switch 310.
Hence, the concept of providing switching functionality within a pNIC is desirable to enable these various models.