The present invention relates to computer systems and remote direct memory access (RDMA), and more specifically, to methods and systems for direct sending of application data via a combination of synchronous and asynchronous processing.
RDMA device and application programming specifications state that posting work requests and dequeueing work completions should be “fast-path operations”, which indicates that the corresponding function calls of a software implementation should be non-blocking. While “non-blocking” is not a precise characterization of a function, it generally means that the function may not sleep. This characterization implies that the function may not wait for (i) a used (or locked) resource to become freed (unlocked) by another thread, or (ii) a remote event, i.e., an event caused by a remote entity such as the transport layer peer or the network. In contrast, a “non-blocking” operation may perform a lengthy calculation as long as its execution time is approximately known, reasonably bounded and deterministic. Conversely, a “blocking” operation is one that may sleep.
An RDMA work request (WR) representing a data transfer operation provides a description of an application data buffer to be sent or received. For an RDMA device, posting a WR typically queues the WR to a FIFO send queue (SQ) or receive queue (RQ). For example, an RDMAP Send or RDMA Write WR may be posted to an SQ. Similarly, reaping a work completion dequeues a work completion from a completion queue (CQ). As stated above, these operations must be non-blocking.
However, the processing of an entire SQ WR such as an RDMA Write operation, including the eventual generation of a work completion (WC), is blocking as defined above because the processing may need to wait for a remote event such as the opening of the TCP congestion window or the peer's TCP receive window. If the Internet Wide Area RDMA Protocol (iWARP) RDMA transport is used and the iWARP protocols are implemented in software by using TCP sockets, then the transmission of an RDMAP message and associated RDMA frames involves the use of socket send, sendmsg or similar operations. In this case, remote events such as network congestion or lack of receive buffers can manifest locally as a closed TCP congestion window, a closed peer TCP receive window, or a lack of write or send space, resulting in a blocking socket send or sendmsg system call. Another example for a blocking operation is the processing of an RDMA Read SQ WR, which needs to wait for the RDMA Read Response from the remote RDMA device after sending an RDMA Read Request. Consequently, attempting to directly and synchronously process an entire SQ WR while posting the WR may block the application process. Analogous restrictions apply to RQ WR processing.
Transmission of RDMA frames is also needed for handling inbound RDMA Read Requests, which are queued on a local Inbound RDMA Read Queue (IRRQ). In a software implementation, the transmission of the associated RDMA Read Response is blocking in the above sense because it may need to wait for a remotely triggered event as described above for the processing of a SQ WR.
To ensure that posting a WR is non-blocking, a known solution is to process the RDMA operation described by the WR asynchronously. For an RDMA software implementation in a multi-tasking OS environment, such asynchronous processing can occur through a separate task or thread, be it in user space or in the OS kernel. However, delegating work to another task results in additional overhead as described below.
If a separate task or thread is used for asynchronous processing of RDMA operations and this task or thread should be able to handle multiple connections in a fair and non-blocking fashion, it is not always possible to fully process an RDMA operation, as this operation might block and prevent progress with other connections.
For the iWARP RDMA transport, if a separate kernel thread is used for asynchronous transmission (i.e., outside the user process context), then DDP segmentation and transport framing for sending an RDMAP message must access the user's source buffer through its underlying pages, since the buffer is not accessible via user virtual addresses. The pages are known to the iWARP sender through earlier memory registration (including memory pinning) performed by the user. A kernel thread can access these pages after mapping them to kernel virtual addresses. On a 32-bit processor, a kernel thread typically cannot access a user buffer through user virtual addresses due to address space limitations.
In an iWARP software implementation, asynchronously handling transmission presents several problems. By using a separate task or thread for asynchronous transmission in addition to the user process, a much higher context switch rate may result, causing increased CPU utilization and cache disturbance. Compared to synchronous processing in user process context, the code path length may grow. Using a task or thread per connection is undesirable because such a design would not scale to many connections. When using one task or thread for multiple connections, transmission operations associated with one connection may block operations for other connections. Due to network congestion or a closed TCP receive window, it may not be possible to fully process a given, possibly lengthy RDMA operation without blocking. When using one task or thread for multiple connections, the presence of lengthy operations and/or multiple work requests queued per connection raises fairness issues regarding the use of the data link. Before a kernel thread can access the user's source buffer through the underlying pages, these pages need to be mapped to kernel virtual addresses. On a 32-bit processor, kernel virtual addresses are a precious resource, and mapping a large number of pages can be problematic. When a kernel thread doing DDP segmentation and transport framing accesses a page of the user's source buffer after mapping it to kernel virtual addresses, L1 data cache performance may be degraded. Since this cache is keyed through virtual addresses, it may be unable to detect that the user and kernel virtual addresses of the source buffer in fact refer to the same physical memory, causing unnecessary L1 data cache misses. On the other hand, synchronously handling transmission is problematic as this operation may block due to remote or local events.