Single Root I/O Virtualization (SR-IOV) is a specification defined by computer industry group PCI-SIG for improving I/O performance in scenarios where a physical PCIe I/O device is shared among multiple applications or virtual machines (known generally as “I/O virtualization”). With a typical, hypervisor-based approach to I/O virtualization (which does not make use of SR-IOV), a hypervisor emulates the physical I/O device using a virtual I/O device. For example, the virtual I/O device receives, from a guest device driver in each virtual machine (VM), I/O operations requested by the VM and processes the I/O operations before passing them on to the physical I/O device via a host device driver. Conversely, the virtual I/O device receives, from the host device driver, I/O operations requested by the physical I/O device (and destined for a particular VM) and processes the I/O operations before passing them on to the appropriate VM via the VM's guest device driver. While this hypervisor-based approach is functional, it is inefficient because the I/O operations must traverse two I/O stacks—one in the VM and another in the hypervisor—which increases latency. In addition, this approach incurs CPU overhead in order to implement the virtual I/O device, which can reduce the maximum throughput to/from the physical I/O device (due to, e.g., the additional CPU clock cycles needed to process I/O at the hypervisor level).
SR-IOV overcomes the inefficiencies above by allowing the physical I/O device to directly write data to, and read data from, the guest memory space of each VM sharing the device, thereby bypassing the hypervisor. This eliminates the overhead incurred by the virtual I/O device and enables the system hosting the physical I/O device to achieve a level of I/O performance that is similar to non-virtualized scenarios. In practice, a physical I/O device that supports SR-IOV (referred to herein as a “SR-IOV device”) implements multiple, independent virtual functions (VFs), each of which appears on the PCIe bus as a separate instance of the device. These multiple VFs map to a single PCIe physical function (PF) of the physical I/O device. At runtime, the hypervisor assigns one or more VFs to each VM executing on the host system. Each VF then communicates directly with the guest device driver (i.e., “VF driver”) within the VF's assigned VM to enable data movement between VM guest memory and the physical I/O device via direct memory access (DMA), without requiring intermediary processing by a virtual I/O device in the hypervisor. For instance, when a SR-IOV device receives data destined for a particular VM, the VF assigned to that VM uses DMA to directly copy the data to one or more receive (RX) buffers in VM guest memory. The SR-IOV device then posts a hardware interrupt to the hypervisor indicating that the DMA transaction is complete. In response to the hardware interrupt, the hypervisor injects a virtual interrupt into the target VM, thereby signaling to the VF driver in the VM that the data in the RX buffers may be processed.
One limitation with SR-IOV as it exists today is that, due to the manner in which the physical I/O device's VF directly writes data to VM guest memory, SR-IOV is incompatible with certain virtualization features, such as live VM migration (e.g., vMotion). To understand this incompatibility, consider the typical workflow for a live VM migration event. During a long, first phase (known as the “pre-copy” phase), the hypervisor on a source host copies VM memory pages from the source host to a destination host while the VM is running. Since the VM is active during this phase, the hypervisor keeps track of the memory pages that are modified (i.e., dirtied) by the VM as it runs and copies those pages over (potentially multiple times) to ensure that the destination host has the VM's most up-to-date memory state. The hypervisor is able to track this for CPU-initiated writes, since the hypervisor virtualizes VM memory page tables in one or more nested, hypervisor-level page tables (referred to as Extended Page Tables, or EPT). Then, during a short, second phase (known as the “switch-over” phase), the original VM on the source host is shut down and the new VM on the destination host is brought up.
However, when SR-IOV is enabled, the CPU is not the only entity capable of writing data into VM guest memory; as noted above, the VF of a SR-IOV device may also write data into VM guest memory using DMA. The hypervisor cannot track these VF-initiated DMA writes because the EPT is only updated for CPU-initiated memory transactions. As a result, the VM memory pages that are modified by the SR-IOV device via DMA cannot be identified by the hypervisor as “dirty” during the pre-copy phase of the VM migration, and thus cannot be properly copied over to the destination host, thereby breaking the migration process. Similar problems exist when attempting to use SR-IOV in conjunction with other virtualization features that rely on hypervisor-level tracking of dirty VM memory pages, such as snapshots, fault tolerance, etc.
One known solution for this incompatibility is to modify the VF driver and guest operating system (OS) running within each VM to notify the hypervisor whenever a memory page has been dirtied due to a VF-initiated DMA write. The hypervisor can then mark those pages as dirty in the EPT for facilitating VM migration (or other features). Unfortunately, since this solution effectively requires the VF driver and guest OS to be para-virtualized, it will not work with standard OS/driver distributions. Further, this solution may fail in scenarios where the VM is temporarily suspended or stopped but the VF of the SR-IOV device continues performing DMA writes to the VM's RX buffers. In these scenarios, the code resident within the VM for notifying the hypervisor will also be suspended, and thus the hypervisor will not know (during the period of VM downtime) which memory pages are dirtied by the SR-IOV device.