Conventional network interface devices provide an applications programming interface (API) that implements message passing (data communication) between nodes. For example, in MPI (Message Passing Interface) implementations, the API supports basic calls such as Send and Receive. The Receive call permits the calling node to specify what message(s) it is willing to receive, and where the message(s) should be stored upon arrival. Conceptually, an MPI implementation has two message queues (also referred to as lists), a posted receive queue and an unexpected message (or simply unexpected) queue. The posted receive queue contains a list of outstanding receive requests, and the unexpected message queue contains a list of messages that have arrived but do not match any previously posted receive requests. Incoming messages traverse the posted receive queue for a possible match, and end up in the unexpected queue if no match is found. Before a receive request can be added to the posted receive queue, the unexpected queue must be searched for a possible match. This search of the unexpected queue is an atomic operation in order to ensure that a matching message does not arrive during the time between the search of the unexpected queue and the posting of the receive request in the posted receive queue.
MPI messages are matched using three fields, namely, a context identifier, a source rank and a message tag. The context identifier represents an MPI communicator object, providing a safe message passing context so that messages from one context do not interfere with messages from other contexts. The source rank represents the local rank of the sending process within the communicator, and the message tag is a user-assigned value that can be used for further message selection within a particular context. A posted receive request must explicitly match the context identifier, but may “wildcard” the source rank and message tag values to match against any value. In addition to these matching criteria, MPI also requires that messages between two nodes in the same context must arrive in the order in which the corresponding sends were initiated.
In prior art MPI implementations, the posted receive queue and the unexpected queue are represented as linear lists. Accordingly, the time required to traverse the queues increases linearly with the lengths of the lists. In applications that include parallel processes, the lengths of the lists can grow linearly with the number of parallel processes. In some networks, the time spent traversing an arbitrarily long queue may impact the entire system, because the network interface may be unable to service any other requests during the search. This can lead to a situation where a poorly written or erroneous application can affect the performance of other applications in the system. The time required to find a matching entry can be reduced by using hash tables, but the use of hash tables increases the time required to insert an entry into the list.
Prior art approaches have used network interface hardware specifically for MPI. For example, a general purpose processor (in some cases implemented by several individual processors) embedded within a network interface chip (NIC) can run a user-context thread to process incoming messages. In this approach, much of the protocol processing needed to support MPI can occur on the network interface. However, the embedded processor manages the queues as linear lists. In the Red Storm supercomputer, the network interface chips implement the Portals programming interface, which provides protocol building blocks that support general network functionality and MPI. However, incoming messages traverse a linear list in Portals. Other approaches attempt to use the network interface to implement MPI collective operations efficiently. These approaches focus on protocol optimizations and efficient data movement operations.
It is desirable in view of the foregoing to provide message passing implementations with list traversal capabilities that avoid the delays associated with linear list traversal. Exemplary embodiments of the present invention utilize associative matching structures which permit list entries to be searched in parallel fashion, thereby avoiding the delay of linear list traversal. List management capabilities are provided to support list entry turnover and priority ordering semantics.