In many computer systems, peripheral devices communicate with the central processing unit (CPU) and with one another over a peripheral component bus, such as the PCI-Express® (PCIe®) bus. Such peripheral devices may include, for example, a solid state drive (SSD), a network interface controller (NIC), and various accelerator modules, such as a graphics processing unit (GPU).
Methods for directly accessing the local memory of a peripheral device via PCIe and other peripheral component buses are known in the art. For example, U.S. Patent Application Publication 2015/0347349, which is assigned to the assignee of the present patent application and whose disclosure is incorporated herein by reference, describes a method for communicating between at least first and second devices over a bus in accordance with a bus address space, including providing direct access over the bus to a local address space of the first device by mapping at least some of the addresses of the local address space to the bus address space. The term “direct access” means that data can be transferred between devices, over the bus, with no involvement of the software running on the CPU in the data plane.
As another example, GPUDirect RDMA is an application program interface (API) that supports interaction between an InfiniBand™ NIC (referred to as a host channel adapter, or HCA) and peer memory clients, such as GPUs. It is distributed by Mellanox Technologies Ltd. (Yokneam, Israel). This API provides a direct P2P (peer-to-peer) data path between the GPU memory and Mellanox HCA devices. It enables the HCA to read and write peer memory data buffers, and thus allows RDMA-based applications to use the computing power of the peer device without the need to copy data to host memory.
Transactions on the PCIe bus fall into two general classes: posted and non-posted, as defined in section 2.4.1 (pages 122-123) of the PCI Express Base Specification (Rev. 3.0, referred to hereinbelow simply as the “PCIe specification”). In non-posted transactions, the device that initiates the transaction (referred to as the “requester”) expects to receive a completion Transaction Layer Packet (TLP) from the device completing the request, thus confirming that the completer received the request. Read requests are an example of non-posted transactions. In posted transactions, the requester does not expect to and will not receive a completion TLP. Write requests are an example of posted transactions, and thus the requester will generally not know when or even whether the write transaction was successfully completed.
Because PCIe transactions rely on transmission and reception of packets over a bus fabric, it can sometimes occur that when two transactions are directed to the same device, the transaction that was requested later will be the first one to reach the device. The PCIe specification imposes certain rules on the ordering of transmission of TLPs by switches on the bus, for example that non-posted transactions (such as read requests) must not pass posted transactions (such as write requests). On the other hand, some PCIe devices and applications use “relaxed ordering” for enhanced performance. When the relaxed ordering attribute bit is set in a TLP (as defined in section 2.2.6.4 of the PCIe specification, page 75), switches on the PCIe bus are not required to observe strong write ordering with respect to this TLP, and write transactions can thus be forwarded and executed out of order. Relaxed ordering allows the host bridge to transfer data to and from memory more efficiently and may result in better direct memory access (DMA) performance.