1. Technical Field
The present application relates generally to an improved data processing system and method. More specifically, the present application is directed to an apparatus and method for splitting endpoint address translation cache management responsibilities between a device driver and device driver services.
2. Description of Related Art
On some systems, with current Peripheral Component Interconnect (PCI) protocols, when performing direct memory access (DMA) operations, address translation and access checks are performed using an address translation and protection table (ATPT). Though ATPTs have been in use for several decades, they are new to lower end systems and are known by several other names, such as a Direct Memory Access (DMA) Remapping Resource or an Input/Output Memory Management Unit (IOMMU). The ATPT stores entries for translating PCI bus addresses, utilized in DMA transactions, to real memory addresses used to access the real memory resources. The entries in the ATPT store protection information identifying which devices may access corresponding portions of memory and the particular operations that such devices may perform on these portions of memory.
Recently, the PCI-SIG has been in the process of standardizing mechanisms that allow the address translations resident in an ATPT to be cached in a PCI family adapter. These mechanisms are known as Address Translation Services (ATS). ATS allows a PCI family adapter to request a translation for an untranslated PCI Bus address, where a successful completion of such a request on a system that supports ATS returns the translated, i.e. real memory address, to the PCI family adapter. ATS allows a PCI family adapter to then mark PCI bus addresses used in DMA operations as translated. A system that supports ATS will then use the translated addresses to bypass the ATPT. ATS also provides a mechanism by which the host side (e.g. hardware or virtualization intermediary) can invalidate a previously advertised address translation.
FIG. 1 is an exemplary diagram illustrating a conventional mechanism for performing DMA operations using an ATPT and the PCI express (PCIe) communication protocol. The depicted example also shows the PCJe address translation service (ATS) described above, which is invoked by PCIe endpoints, e.g., PCIe input/output (I/O) adapters that use ATS to perform address translation operations. ATS functionality is built into the PCIe endpoints and the root complex of the host system, as discussed hereafter. For more information regarding PCIe ATS, reference is made to the PCIe ATS specification available from the peripheral component interconnect special interest group (PCI-SiG) website.
As shown in FIG. 1, the host CPUs and memory 110 are coupled by way of a system bus 115 to a PCIe root complex 120 that contains the address translation and protection tables (ATPT) 130. The PCIe root complex 120 is in turn coupled to one or more PCLe endpoints 140 (the term “endpoint” is used in the PCLe specification to refer to PCIe enabled I/O adapters) via PCIe link 135. The root complex 120 denotes the root of an I/O hierarchy that connects the CPU/memory to the PCIe endpoints 140. The root complex 120 includes a host bridge, zero or more root complex integrated endpoints, zero or more root complex event collectors, and one or more root ports. Each root port supports a separate I/O hierarchy. The I/O hierarchies may be comprised of a root complex 120, zero or more interconnect switches and/or bridges (which comprise a switch or PCI fabric), and one or more endpoints, such as endpoint 140. For example, PCIe switches may be used to increase the number of PCIe endpoints, such as endpoint 140 attached to the root complex 120. For more information regarding PCI and PCIe, reference is made to the PCI and PCIe specifications available from the PCI-SiG website.
The PCIe endpoint includes internal routing circuitry 142, configuration management logic 144, one or more physical functions (PFs) 146 and zero or more virtual functions (VFs) 148-152, where each VF is associated with a PF. ATS permits each virtual function to make use of an address translation cache (ATC) 160-164 for caching PCI memory addresses that have already been translated and can be used by the virtual function to bypass the host ATPT 130 when performing DMA operations.
In operation, the PCIe endpoint 140 may invoke PCIe ATS transactions to request a translation of a given PCI bus address into a system bus address and indicate that a subsequent transaction, e.g., a DMA operation, has been translated and can bypass the ATPT. The root complex 120 may invoke PCIe ATS transactions to invalidate a translation that was provided to the PCIe endpoint 140 so that the translation is no longer used by the physical and/or virtual function(s) of the PCIe endpoint 140.
For example, when a DMA operation is to be performed, the address of the DMA operation may be looked-up in the ATC 160-164 of the particular virtual function 148-152 handling the DMA operation. If an address translation is not present in the ATC 160-164, then a translation request may be made by the PCIe endpoint 140 to the root complex 120. The root complex 120 may then perform address translation using the ATPT 130 and return the translated address to the PCIe endpoint 140. The PCIe endpoint 140 may then store the translation in an appropriate ATC 160-164 corresponding to the physical and/or virtual function that is handling the DMA operation. The DMA operation may be passed onto the system bus 115 using the translated address.
If a translation for this address is already present in the ATC 160-164, then the translated address is used with the DMA operation. A bit may be set in the DMA header to indicate that the address is already translated and that the ATPT 130 in the root complex 120 may be bypassed for this DMA. As a result, the DMA operation is performed directly between the PCIe endpoint 140 and the host CPUs and memory 110 via the PCIe link 135 and system bus 115. Access checks may still be performed by the root complex 120 to ensure that the particular BDF number of the virtual function of the PCIe endpoint corresponds to a BDF that is permitted to access the address in the manner requested by the DMA operation.
At some time later, if the translation that was provided to the PCIe endpoint 140 is no longer to be used by the PCIe endpoint 140, such as when a translation has changed within the ATPT 130, the root complex 120 must issue an ATS invalidation request to the PCIe endpoint 140. The PCIe endpoint 140 does not immediately flush all pending requests directed to the invalid address. Rather, the PCIe endpoint 140 waits for all outstanding read requests that reference the invalid translated address to retire and releases the translation in the ATC 160-164, such as by setting a bit to mark the entry in the ATC 160-164 to be invalid. The PCIe endpoint 140 returns an ATS invalidation completion message to the root complex 120 indicating completion of the invalidating of the translation in the ATC 160-164. The PCIe endpoint 140 ensures that the invalidation completion indication arrives at the root complex 120 after any previously posted writes that use the invalidated address.
Typically, the ATPT 130 may be provided as tree-structured translation tables in system memory. A different tree-structure may be provided for each PCI Bus/Device/Function (BDF) of the computing system. Using these ATPT data structures, devices may share a device address space and devices may have dedicated address spaces. Thus, not all devices may perform all DMA operations on all address spaces of the system memory.
The accessing of the ATPT 130 is done synchronously as part of the DMA transaction. This involves utilizing a time consuming translation mechanism for: translating the untranslated PCI bus memory addresses of the DMA transactions to translated real memory addresses used to access the host's memory; and checking the ATPT to ensure that the device submitting the DMA transaction has sufficient permissions for accessing the translated real memory addresses and has sufficient permissions to perform the desired DMA operation on the translated real memory addresses.
As part of accessing the ATPT 130, the correct ATPT tree data structure corresponding to a particular BDF must be identified and the tree data structure must be walked in order to perform the translation and access checking. The location of the ATPT tree data structure may require one or two accesses to find the address of the tree data structure associated with the BDF. Once found, it may take 3 or 4 accesses of the tree data structure to walk the tree. Thus, this translation and access checking is responsible for the large latencies associated with DMA operations. These latencies may cause serious issues with endpoints that require low communication latency.
As a way of mitigating these latencies, the ATS implemented in the PCIe endpoint 140 utilizes the ATCs 160-164 to store already performed address translations so that these translations need not be performed again. Thus, through a combination of the ATPT and the ATCs, the PCI ATS performs address translations and access checks in such a manner as to reduce the latency associated with DMA operations. While the PCI SiG has set forth a specification for the PCIe ATS, the PCI SiG has not specified how the responsibilities for performing address translation using ATS and managing ATS structures, such as the ATPT and ATCs, are to be apportioned in a system implementing the PCIe ATS.