1. Field of the Invention
This invention relates to computer networks. More particularly, this invention relates to inter-process communication over computer networks.
2. Description of the Related Art
The meanings of certain acronyms and abbreviations used herein are given in Table 1.
TABLE 1Acronyms and AbbreviationsCPUCentral Processing UnitHCAHost Channel AdapterIBInfiniBand ™IDIdentifierIO; I/OInput and OutputNICNetwork Interface ControllerPSNPacket Sequence NumberQPQueue PairRCReliable ConnectionRDReliable DatagramRDCReliable Datagram ChannelRDDReliable Datagram DomainsRDMARemote Direct Memory AccessRTMRemote Transactional MemoryTCATarget Channel AdapterTLPTransaction Layer PacketTX-QPNTransaction QP TableUCUnreliable ConnectionUDUnreliable DatagramWQEWork Queue ElementWRWork RequestXRCExtended Reliable Connected Transport Service
InfiniBand™ (IB) is a switched-fabric communications architecture primarily used in high-performance computing. It has been standardized by the InfiniBand Trade Association. Computing devices (host processors and peripherals) connect to the IB fabric via a network interface controller (NIC), which is referred to in IB parlance as a channel adapter. Host processors (or hosts) use a host channel adapter (HCA), while peripheral devices use a target channel adapter (TCA). IB defines both a layered hardware protocol (physical, link, network, and transport layers) and a software layer, which manages initialization and communication between devices. The transport layer is responsible for in-order packet delivery, partitioning, channel multiplexing and transport services, as well as data segmentation when sending and reassembly when receiving.
InfiniBand specifies the following transport services:
Reliable Connection (RC). RC provides reliable transfer of data between two entities, referred to as a requester and a responder. As a connection-oriented transport, RC requires a dedicated queue pair (QP) comprising requester and responder processes.
Unreliable Connection (UC). UC permits transfer of data between two entities. Unlike RC, UC but does not guarantee message delivery or ordering. Each pair of connected processes requires a dedicated UC QP.
Reliable Datagram (RD). Using RD enables a QP to send and receive messages from one or more QPs using a reliable datagram channel (RDC) between each pair of reliable datagram domains (RDDs). RD provides most of the features of RC, but does not require a dedicated QP for each process.
Unreliable Datagram (UD). With UD, a QP can send and receive messages to and from one or more remote QPs, but the messages may get lost, and there is no guarantee of ordering or reliability. UD is connectionless, allowing a single QP to communicate with any other peer QP. Raw Datagram. A raw datagram is a data link layer service, which provides a QP with the ability to send and receive raw datagram messages that are not interpreted.
A recent enhancement to InfiniBand is the Extended Reliable Connected (XRC) transport service (as described, for instance, in “Supplement to InfiniBand™ Architecture Specification Volume 1.2.1, Annex A14: Extended Reliable Connected (XRC) Transport Service”, 2009, Revision 1.0). XRC enables a shared receive queue (SRQ) to be shared among multiple processes running on a given host. As a result, each process can maintain a single send QP to each host rather than to each remote process. A receive QP is established per remote send QP and can be shared among all the processes on the host.
One mode of operation applicable to IB is disclosed in commonly assigned U.S. Pat. No. 8,761,189, which is herein incorporated by reference. The mode includes allocating, in a MC, a single dynamically-connected (DC) initiator context for serving requests from an initiator process running on the initiator host to transmit data to multiple target processes running on one or more target nodes. The NIC transmits a first connect packet directed to a first target process and referencing the DC initiator context so as to open a first dynamic connection with the first target process. The NIC receives over the packet network, in response to the first connect packet, a first acknowledgment packet containing a first session identifier (ID). Following receipt of the first acknowledgment packet, the MC transmits one or more first data packets containing the first session ID over the first dynamic connection from the NIC to the first target process. Dynamic connections with other target processes may subsequently be handled in similar fashion.
Communication and synchronization among processors are conventionally managed by systems such as message-passing and shared memory. Both of these have significant drawbacks. For example, message-passing involves exchange of specific messages among independent nodes. While useful in cases where the underlying hardware configuration is relatively simple, it requires data structures to be organized and integrated with execution units for particular sets of applications. Shared memory implementations often require extensive cache line tracking mechanisms to avoid unacceptable latencies and to assure memory consistency. The hardware to accomplish this becomes complex.
Another form of communication synchronization is the transactional memory (TM) model. This is a concurrency control mechanism working on shared memory. It allows a group of loads and stores to execute in an atomic manner, i.e., although the code in the group may modify individual variables through a series of assignments, another computation can only observe the program state immediately before or immediately after the group executes.
In the transactional memory model a transaction may be represented from a programmer's perspective as follows
Listing 1Atomic {R0 = read [X]Optional General Purpose computeWrite[Y], R1}
From the user's perspective, the transactional memory model constitutes a major paradigm shift compared to existing locking paradigms to ensure data consistency. With locking, there is an association between a location in memory and set of memory locations that store data. This method is bug-prone and often introduces contention for accessing the shared locks.
TM does not require an association between data and locks. The user directly specifies the accesses that need to occur atomically to guarantee consistency, and the underlying mechanism achieves this atomicity. In addition, in the event that a specific user thread fails to complete an atomic transaction, the referenced memory locations are not locked and may be accessed by other transactions. During the transaction the memory can be associated with respective intermediate states and a final state, such that when one process performs the accesses responsively to remote I/O operations that are initiated by the network interface controller, the states are concealed from other processes until occurrence of the final state.
In other words, the atomicity semantics of a transaction imply that from a memory-ordering point of view, all operations within a transaction happen in ‘0 time’, so intermediate states are concealed, i.e., they are never exposed to other processes, only the final state is exposed. Such a transaction, executing under control of a network interface controller, follows an all-or-nothing paradigm, where either all side effects of the transaction happen together or not at all. In the event that a transaction fails, e.g., because another thread has attempted to access the same memory location, the memory remains in its pre-transaction state.
An example of a transactional memory computing system is proposed in U.S. Patent Application Publication No. 2009/0113443. A computing system processes memory transactions for parallel processing of multiple threads of execution and provides execution of multiple atomic instruction groups (AIGs) on multiple systems to support a single large transaction that requires operations on multiple threads of execution and/or on multiple systems connected by a network. The support provides a transaction table in memory and fast detection of potential conflicts between multiple transactions. Special instructions may mark the boundaries of a transaction and identify memory locations applicable to a transaction. A ‘private to transaction’ tag, directly addressable as part of the main data storage memory location, enables a quick detection of potential conflicts with other transactions that are concurrently executing on another thread. The tag indicates whether (or not) a data entry in memory is part of a speculative memory state of an uncommitted transaction that is currently active in the system.