In a typical parallel computer environment, multiple, independent computing nodes communicate with one another over an interconnection network. In particular, messages are sent from one node to another node within the computer environment via the interconnection network.
During message transmission, it is possible for a message to be lost due to, for instance, an error within the network, a busy condition, a disconnected cable, or mechanical failure on the line. However, for proper operation within the computer environment, it is imperative that a destination node receive any messages intended for the node. Thus, to ensure that the sender of a message knows whether or not the destination node has received the message, the destination node sends back an acknowledgement to the sender when it has received the message. If the sender does not receive an acknowledgement at the appropriate time, then the sender knows that the message has not been received.
When an acknowledgement has not been received, then the message is retransmitted to the destination node. However, in certain environments, it is necessary to maintain the sequential order of the messages. That is, the acknowledgements must be in the same sequence as the transmissions. Thus, when an unacknowledged message is retransmitted, any other messages sent thereafter are also retransmitted. This is to allow the messages to be acknowledged in the proper order.
A need exists for an improved capability to track transmission of messages and acknowledgements to those messages. A further need exists for an improved capability that achieves correct sequentiality and preserves ordering of message transmission, even when messages must be sent multiple times to recover from transmission errors.