1. Technical Field
The present application relates generally to an improved data processing system and method. More specifically, the present application is directed to a system and method for providing a memory region or memory window access notification on a system area network.
2. Description of Related Art
In a System Area Network (SAN), such as an InfiniBand™ (IB) network or iWarp network, hardware provides a message passing mechanism that can be used for Input/Output devices (I/O) and interprocess communications (IPC) between general computing nodes. Processes executing on devices access SAN message passing hardware by posting send/receive messages to send/receive work queues on a SAN channel adapter (CA). These send/receive messages are posted as work requests (WR) to the work queues. The processes that post WRs are referred to as “consumer processes” or simply “consumers.”
The send/receive work queues (WQ) are assigned to a consumer process as a queue pair (QP). The messages can be sent over five different transport types: Reliable Connected (RC), Reliable Datagram (RD), Unreliable Connected (UC), Unreliable Datagram (UD), and Raw Datagram (RawD). Consumers retrieve the results of these messages from a completion queue (CQ) through SAN send and receive work completion (WC) queues. The source channel adapter takes care of segmenting outbound messages and sending them to the destination. The destination channel adapter takes care of reassembling inbound messages and placing them in the memory space designated by the destination's consumer.
Two channel adapter types are present in nodes of the SAN fabric, a host channel adapter (HCA) and a target channel adapter (TCA). The HCA is used by general purpose computing nodes to access the SAN fabric. Consumers use SAN “verbs” to access host channel adapter functions. The software that interprets verbs and directly accesses the channel adapter is known as the channel interface (CI).
Target channel adapters (TCA) are used by nodes that are the subject of messages sent from host channel adapters. The TCAs serve a similar function as that of the HCAs in providing the target node an access point to the SAN fabric.
The SAN described above uses the registration of memory regions (MRs) or memory windows (MWs) to make memory accessible to HCA hardware. Using the verbs defined within the SAN specification, these MRs or MWs must be pinned, i.e. they must remain constant and not be paged out to disk, while the HCA is allowed to access them. When the MR or MW is pinned it may not be used by any other application, even if the MR or MW is not being used by the application that owns it. The MR or MW may be thought of as a portion of memory that the consumer will want the channel adapter (HCA or TCA) to use in a future transfer.
The SAN verb interface used with InfiniBand™ and iWarp networks provides remote direct memory access (RDMA) communications between host nodes. RDMA allows data to move directly from the memory of one host node into that of another host node without involving either one's operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters.
Today, a consumer process (e.g., an application layer middleware) utilizing RDMA verbs of the SAN verb interface, creates a memory region (MR) or memory window (MW) on the host node and utilizes this MR or MW to transfer data via RDMA communications. In a RDMA fabric such as this, data is transmitted in packets that are taken together to form a message.
A typical RDMA transaction oriented protocol exchange between a first host node and a second host node in a SAN can be exemplified in the following manner with reference to FIG. 14. A first upper layer protocol (ULP) consumer 1410, i.e. a consumer utilizing an abstract protocol when performing encapsulation of data communications, of a first host node 1405 uses a Send work request (WR) 1415, which is posted to a send/receive queue pair (QP) 1418 associated with the first ULP consumer 1410, to request a RDMA enabled host channel adapter 1420 to perform a ULP Request RDMA data transfer operation from the second host node 1430. The ULP Request in the Send WR 1415 specifies a pre-allocated MR or MW 1440 to be used with the RDMA data transfer. The second host node's ULP consumer 1450 digests the first ULP Request, i.e. generates a hash fingerprint of the first ULP Request, and performs a RDMA Write of the requested data to the pre-allocated MR or MW 1440, via posting of an appropriate RDMA Write WR in its own send/receive queue pair (QP) 1458 which gets transformed into a RDMA data transfer to the first host node 1405.
After the second host node's ULP consumer 1450 sends the requested data to the first ULP consumer 1410 via a RDMA Write data transfer, the second host node's ULP consumer 1450 sends a ULP Response message to the first ULP consumer 1410 in the form of a Work Completion (WC) element 1460 that is posted to a completion queue (CQ) 1470 associated with the first ULP consumer 1410. The first ULP consumer 1410 then retrieves the WC element 1460 that contains the ULP Response message from the corresponding CQ 1470. After receiving the ULP Response message, the first ULP consumer 1410 performs a check of the data transferred by the RDMA Write to make sure that it has the correct signature.
Thus, in order for the first ULP consumer 1410 to know that the requested data transfer has been performed, the target of the Send WR posted by the first host node 1405, i.e. the second ULP consumer 1450, must respond with a WC element 1460 that informs the first ULP consumer 1410 that the requested data is ready for the first ULP consumer 1410 to process. The first ULP consumer 1410 must then retrieve the WC element 1460 from its CQ 1470, process it, and then verify that the data that was transferred is correct.
The above description is directed to known SAN systems such as InfiniBand™ and iWarp. More information about InfiniBand™ may be obtained from the specification documents available from the InfiniBand™ Trade Association at www.infinibandta.org/specs/. More information about iWarp may be found at IETF Remote Direct Data Placement Working Group home page at http://www1.ietf.org/html.charters/rddp-charter.html.
The Portals application program interface (API) is an API that utilizes a one-sided data movement model rather than a two-sided data movement model, such as is used in the SAN environment described above, see http://www.cs.sandia.gov/Portals/portals-info/portals-index.html, for example. The Portals API uses an address triplet plus a set of match bits to perform data transfers between an initiator and a target. This address triplet comprises a process id, a portal id, and an offset. The process id identifies a target process, the portal id specifies a memory buffer or region of memory, i.e. a portal, to be used for the data transfer operation, and the offset specifies an offset within the memory buffer or region of memory.
Specifically, the process id is used to route a message or data transfer to an appropriate node (not depicted). As shown in FIG. 15, the portal id is used as an index into a portal table 1510 of the node. Each element of the portal table 1510 identifies a match list 1520. Each element of the match list specifies two bit patterns: a set of “don't care” bits, and a set of “must match” bits. In addition to the two sets of match bits, each match list element has at most one memory descriptor 1530. Each memory descriptor identifies a memory region, or portal, 1540 and an optional event queue 1550. The memory region or portal 1540 is the portion of memory to be used in the data transfer operation and the event queue 1550 is used to record information about the data transfer operation. The offset is used as an offset into the memory region or portal 1540.
When translating a Portal address, the match criteria of a first element in the match list 1520 corresponding to the portal table 1510 entry associated with the portal id is checked. If the match criteria specified in the match list entry are met, and the memory descriptor 1530 accepts the operation (e.g., a put, get, or swap operation), the operation is performed using the memory region 1540 specified in the memory descriptor. If the memory descriptor specifies that it is to be unlinked when a threshold has been exceeded, the match list entry is removed from the match list 1520 and the resources associated with the memory descriptor 1530 and match list entry are reclaimed. If there is an event queue 1550 specified in the memory descriptor 1530 and the memory descriptor 1530 accepts the event, the operation is logged in the event queue 1550.
If the match criteria specified in the match list entry are not met, or there is no memory descriptor 1530 associated with the match list entry, or the memory descriptor 1530 associated with the match list entry rejects the operation, the address translation continues with the next match list entry in the match list 1520. If the end of the match list 1520 has been reached, the address translation is aborted and the incoming request is discarded.
Thus, with the Portals API, an event may be automatically generated upon the accessing of a previously registered memory region, or portal. Such functionality is not currently present in the known SAN architectures. Therefore, it would be beneficial to have an improved SAN architecture that provides the infrastructure for the generation and tracking of events upon access of a previously registered memory region.