The present invention relates to a data transfer network for transferring data in the form of messages.
Many software applications written for parallel computer systems, particularly those with distributed memory, use a message passing software library as an interface between the application software and the communication network hardware. MPI: A Message-Passing Interface Standard is an example of such a library.
In the case of MPI, the library is implemented as software routines that may be called by application software and are executed by the instruction processor in each node of a parallel computer. The instruction processor controls the node's transmit and receive circuits that are connected to the other nodes of the system via a network.
There are two fundamental software routines in libraries that follow the MPI standard, one being a send request routine that is used to request the sending of a message to another node in the system, and the other being a receive request routine that is used to request the reception of a message.
When an application calls the send request routine to send a message, it specifies several parameters including: the destination node number of the message, each node having a unique identifying number; the message tag value that may be used by the application to distinguish different types of messages; a communicator value that is used to specify the communication context of the send transmission, messages being received only within the same communication context; the address of the data to be sent in local memory; and the quantity of said data.
When an application calls the receive request routine, it specifies the message that it would like to receive and where it should be written in local memory. These parameters are passed to the receive request routine and include: a tag specification which may specify either a specific tag or alternatively a wildcard tag value; a source specification which may specify either a specific source or alternatively a wildcard source value; a communicator value that is used to specify the communication context of the receive transmission, the receive request matching only messages sent within the same communication context; the address of the destination buffer in local memory; and the size of the destination buffer.
The receive routine results in either the immediate return to the application of a message that matches the receive message specification if the message has already been sent to the node, or if the request cannot be completed immediately, the information relating to the request is recorded, this record being called a "posted receive", so that when a request to send such a message is later issued, the receive routine may then complete and return the message to the application software. The MPI standard also specifies that messages should be non-overtaking, meaning that if more than one pending receive request matches a message, then the oldest pending receive request should be satisfied by the message.
An example of hardware that is used to transfer a messages from local memory in one node to that of another node is described in the Japanese patent application that has been made public, number 6-324998. An embodiment of such hardware is represented by FIG. 3. A plurality of nodes are connected to a message transfer network 902, for simplicity, only one node 901 is shown. Each node contains an instruction processor or IP 903, local memory 904 and a network interface adapter or NIA 905 that are all linked together by a bus 906. The NIA 905 contains a transmit circuit 908 and a receive circuit 910 that are connected to the network 902 via a transmit network connection 909 and a receive network connection 911 respectively. A network message 912, including message data 917 has a message header that contains a DNN field 913 that specifics which node should receive the message, as well as a Tag field 916. Within the receive circuit, there is an input port 918, a receive sequencer 919, a direct memory access controller [DMAC] 921, and a receive posting register 920. Upon arrival of a message to the input port 918, the Tag field 916 of the message is checked against a tag field held within the posting register 920. If there is a match, the DMAC 921 is commanded by the receive sequencer 919 to use the user receive buffer 933 to receive the message data 917. The address pointer 932 to the user receive buffer 933 also is located within the posting register 920 and the user receive buffer 933 typically is located with the user's application space 934 in local memory 904. If there is no match, the message is received in the message buffer 941 in local memory.
This hardware provides a receiver that supports only a single posted receive request for a message. It also includes a message buffer for recording messages that do not match the receive posting register. This message buffer can be considered as a message pool, and the described hardware supports only a single message pool.
Using the hardware described above, it is possible to request the reception of only a single message at a time using the receive posting register. However, most message passing libraries, including MPI, support a large number of simultaneously active receive requests and also support wildcards.
If an MPI message passing library were to be implemented that tried to take advantage of the single receive request posting register described above, the library would only be able to use it to post one pending receive request and only if this request did not use wildcards. Also, if it was desired to post a request other than the oldest pending request, in order not to violate the message ordering rules of MPI, it would only be possible if none of the older requests could also match the same message. There would also be software overhead involved with checking for such receive conflicts. After reception of the message posted using the receive request register, the receive circuit would have to interrupt the instruction processor in order for the instruction processor to post another request using the receive request register. This would lead to frequent instruction processor interrupts and increased software overhead tending to reduce the effective processing power of the instruction processor. Meanwhile, messages that are received by the receive circuit in an order different from the order in which the corresponding receive requests were issued by software would be written to the pool. Such messages received by the pool would still have to be copied to the destination buffer by the instruction processor, thereby further reducing performance.
An alternate method of implementing an MPI library using the above-described hardware would involve the exchange of control messages between nodes prior to actual transfer of a message. These control messages could be used to configure the receive posting register before the actual message is transferred and could ensure that block copy operations of message data to the destination buffer are not required. However, such a solution has serious performance problems in that message transfer latency would be large due to the overhead of transferring the control messages. Secondly, due to the asynchronous nature of the control messages, frequent instruction processor interrupts would be required to service them, thus reducing the effective processing power of the instruction processor.
Further, if there are software tasks other than the communicating task running on a node and the communicating task would like to sleep while waiting for a message, it is not possible to set the receive circuit to cause an instruction processor interrupt for more than one particular message. It would be necessary to enable interrupts for reception of messages to the message pool. The interrupt handler software routine would have to check the message pool for reception of the particular message each time a message was received by the receive circuit.
Furthermore, if there are multiple communication contexts in use by a single node, it is not possible to assign separate message pools for messages belonging to these separate communication contexts. If many messages are sent within a single communication context, this can cause pool space for all communication processes to run-out.