This Application relates to message processing systems. More particularly, this Application relates to a system for facilitating the transmission of messages from a source node to a destination node in a message processing system.
Message processing systems, for example, the multiprocessor data processing system 10 depicted in FIG. 1, require reliable message communication paths between respective ones of the processors 121 . . . 12j. The exemplary system 10 of FIG. 1 employs an exemplary communication medium or switch network 20 commonly coupled to the processors 12. The processors may require respective communication adapters 141 . . . 14j to control communications between each processor 12 and the medium 20 via respective connections 161 . . . 16j. Communication between, for example, software application(s) executing on the processors 12 of system 10 can thus be provided via medium 20. Storage medium 22 may be employed in the system to hold the applications, associated data, etc.
Because respective processors may be supporting different, but related application software partitions, messaging must be used as a form of communication between the processors. For example, messages may require transmission from a xe2x80x9csourcexe2x80x9d node (e.g., processor 121) to a xe2x80x9cdestinationxe2x80x9d node (e.g., processor 12j).
The asynchronous nature of the application software partitions on the source and destination nodes often results in a condition where the number of messages sent from a source node exceed the destination node""s ability to handle them. Normally, the destination node is expected to post buffers to hold incoming messages. The messages can then be retrieved from the buffers and appropriately processed by the application software. This is illustrated in FIG. 2, which is a hybrid hardware/software diagram of a message processing system like that of FIG. 1 and which depicts a message source node 181 and a message destination node 18j. (The term xe2x80x9cnodexe2x80x9d is used broadly herein to connote any identifiable combination of hardware and/or software to or from which messages are passed.) Source node 181 has allocated therein send message buffers 30 within which are placed messages M(1), M(2) and M(3) which, for application reasons, are required to be sent through send message processing 32, across medium 20, to destination node 18j.
Destination node 18j, in anticipation of the arrival of messages from various sources in the system, can allocate or post receive buffers 40. In the example of FIG. 2, buffer B1 holds the first arriving message M(1), buffer B2 holds the second arriving message M(2) and buffer B3 holds the third arriving message M(3). Received message processing 42 then removes messages from their buffers and can then pass the messages to receive processing 44 (e.g., the application software partition executing at the destination node).
Those skilled in the art will understand that message ordering in a system can be imposed by using a particular protocol, e.g., messages sent from a particular source to a particular destination may be sequentially identified and the sequential indicia can be transmitted as control information along with the data portions of the messages.
The process of allocating or posting receive buffers 40 in destination node 18j is often a dynamic one, and if more messages are arriving than there are buffers posted, buffer overrun can occur. Traditional solutions to avoid buffer overrun at the destination node include 1) data buffering with a pre-reservation protocol or, 2) adopting a convention wherein the destination node automatically discards packets assuming that the source node will retransmit them after a time-out. The first solution assumes a destination node that is frequently unprepared to accommodate data, and the second solution assumes a destination that is rarely unprepared to accommodate data.
A problem with the first solution occurs when message size is practically unbounded, or if the number of message sources is large. Large messages can be decomposed into smaller pieces and flow controlled into the buffers, if the overhead to do so is manageable. However, many sources present problems with buffer fragmentation or starvation. Distributed fairness protocols can be introduced to solve these problems, but at a price in complexity and additional overhead.
A problem with the time-out/retransmit solution is that should the destination be unable to accommodate the data for an extended period of time, many needless retransmits will occur, occupying otherwise useful bandwidth on the medium.
A third conventional solution to this problem is a rendezvous protocol. A rendezvous protocol involves the transmission from the source node of a control information packet relating to a message to be sent from the source node to the destination node. The control information may include an indication of the length of the entire data portion of the message to be sent, as well as indicia which identifies the message and/or its sequence. When a buffer of adequate length is allocated or posted at the destination node, an acknowledgment packet transmission is sent from the destination node to the source node, and the source node can thereafter reliably send the entire message to the destination node. This technique also makes conservative assumptions about the preparedness of the destination node to accommodate the data portion of the message. In conventional rendezvous protocols, the initial exchange of the control information and acknowledgment packets results in a loss of performance because two packets are now required to be exchanged between the source and destination nodes before any actual message data can be exchanged.
What is required, therefore, is a method, system, and associated program code and data structures, which prevent the performance degradation associated with packet retransmission after time-outs, or with standard rendezvous protocols in which an exchange of packets between source and destination nodes occurs before any actual message data is exchanged.
The shortcomings of the prior approaches are overcome by the present invention, which relates to a system for facilitating the efficient transmission and flow control of messages from a source node to a destination node in a message processing system.
The present invention seeks to strike a balance between the ultra-conservatism of pure buffering and rendezvous, and the ultra-optimism of time-out/retransmit. The present invention assumes that the destination is generally able to accommodate data portions of messages, but if it is not, the time that it may take to become prepared may be very long. Such conditions often arise in multi-tasking systems where context swaps between processes are long and a process may be suspended for an unbounded period of time. To accommodate this type of environment, the present invention involves optimistically sending the data portion of a message along with control information, in an initial transmission from the source to the destination. However, it is not appropriate for the destination to discard the entire content of this transmission if it is unable to accommodate the data since there may be many time-out periods before the destination is subsequently able to accommodate the data. Therefore, the destination retains enough control information to identify the message to the source, but discards the data portion of the message (i.e., xe2x80x9cruntsxe2x80x9d the message). The source does not time-out/retransmit, rather, it waits for the destination to notify it that it is prepared. At that time, the source retransmits the message, knowing the destination will be able to accommodate it. The number of retransmissions from the source node to the destination node is therefore bounded to one.
In that regard, in one aspect, the present invention relates to a flow control method for transmitting a plurality of messages from a source node to a destination node in a message processing system. The plurality of messages includes a first message comprising a data portion. The source node transmits the data portion of the first message, and control information of the first message, to the destination node. In response to the destination node being unable to accommodate the data portion of the first message, the destination node discards the data portion of the first message.
In further response to the destination node being unable to accommodate the data portion of the first message, the destination node retains at least some of the control information of the first message.
In response to the destination node being subsequently able to accommodate the data, the destination node uses at least some of the retained control information to transmit a first xe2x80x9cpullxe2x80x9d request to the source node to retransmit the data portion of the first message. In response to this pull request, the source node retransmits the data portion of the first message to the destination node.
The discarding process is repeated for messages subsequent to the first message, until the destination node becomes able to accommodate the data portion of the first message, as well as the data portions of the subsequent messages. In that regard, the present invention relates to, in another aspect, the destination node discarding the data portion of the first message (in response to being unable to accommodate the data portion) but retaining sequence indicia of the control information thereof, and sends a negative acknowledgment relating to the first message to the source node (i.e., xe2x80x9cruntsxe2x80x9d the first message). Until it is able to accommodate any data portions of messages, the destination node discards respective data portions of subsequent messages that are received thereby, but retains the respective sequence indicia of the respective control information thereof, and sends respective negative acknowledgments relating thereto the source node.
As the destination node becomes able to accommodate the respective data portions of the first message and any of the subsequent messages, the destination node initiates, via respective pull requests to the source node, respective retransmissions of the respective data portions of the first message and said any of the subsequent messages. In response to respective pull requests from the destination node, the source node retransmits the respective data portions of the first message and said any of the subsequent messages to the destination node.
To control this operation at both the source and destination nodes, the source node maintains a message sent number, as well as an expected acknowledgment number, which is incremented as respective acknowledgments of successfully accommodated data portions of messages are received from the destination node. The destination node maintains a respective message number which is incremented as respective initial transmissions or retransmissions of data portions are successfully accommodated, as well as an expected xe2x80x9cruntxe2x80x9d number which is incremented as respective data portions of messages are discarded and negative acknowledgments transmitted to the source node therefor.
The herein disclosed flow control variant of the rendezvous protocol strikes a balance between the ultra-conservatism of pure buffering and rendezvous, and the ultra-optimism of time-out/retransmit, since it assumes that the destination is generally prepared, but if it is not, the time it may take to become prepared can be very long. This optimistic assumption leads to medium bandwidth savings, and further, the number of retransmits from the source node to the destination node is bounded to one.