Processes located at nodes of a communications network frequently need to be able to send messages to and receive messages from other processes at different nodes. It is important for the efficiency of some of these data communications to be able to transfer messages in a batch of several messages. Such batching of messages speeds message throughput and can reduce network communication traffic by limiting control communications (such as sender and receiver location information, confirmations of receipt and commit processing) to one set of communication flows per batch instead of one set per message. In transaction processing systems, committing updates on completion of a transaction involves a relatively high processing overhead, so only committing at the end of a batch of transactional updates can significantly improve system efficiency.
In the context of the present invention the phrases "message transfer" and "messaging" are to be interpreted as including, where the context permits, packet switching and the transfer between network nodes of any data transfer unit, and including the transfer of any application data message, error message, reconfiguration message, or network status message. In a distributed data processing network, the network nodes are data processing resources, e.g. (i) computer systems having computer programs installed thereon, these systems being connected for communication over underlying network links, or, (ii) in networks for which communications management programs are the network entities to which messages are sent, the communication management programs. In a packet switching network, the "nodes" are the switches.
Communication between remote processes within a distributed network may involve multiple intermediate network links and nodes forming a communication path. Messages travelling from one node of the network to an adjacent node may have different ultimate destinations and yet it may be most efficient to transfer several of these messages together as a batch. The sending node can request confirmation that the batch has successfully arrived and been stored at the adjacent receiving node as a final stage of transmission of the batch, such that it is unnecessary to check successful transfer of each message separately.
There are various reasons why it may not be possible to successfully store one or more of the messages of a batch at the receiving node. For example, the destination address may not be recognized (perhaps the named destination does not exist), or there may be a problem with the message storage facilities of the receiving node. For example, the storage facility may already be full or the system may have been set so as not to allow messages to be added to storage.
It is a general requirement of data processing systems to ensure that critical data communications (e.g. messages which affect critical data, such as a funds transfer in a banking system) are successfully completed once and once only. Thus, facilities are required for recovering from failures, whether these be failures of the data processing systems, their connecting communication links, or application detected errors. In a messaging system that uses batching, the implementation of this requirement may involve ensuring that if not all of the batch of messages transmitted from a first node can be successfully stored at the next adjacent node, then the whole batch is backed out. That is, either the whole batch is successfully transferred and stored at a receiver node or no messages within the batch are stored at the receiver node and all of the messages are retained in storage at the sender node. Transfer of a backed out batch of messages can subsequently be retried.
A problem with this all-or-nothing approach to batch transfer of messages is that repeated backouts of a batch will impact message transfer performance. A large batch of messages may repeatedly fail to be transferred because of a recurring problem which is associated with only one message in the batch.
One solution to this problem which avoids repeatedly backing out the entire batch is to store problem messages on a "dead-letter queue" at the receiver node. Rather than simply discarding problem messages (which is unacceptable in systems where messages must not be lost), a dead letter queue is provided as a storage facility of the receiver system into which undeliverable messages are placed. Use of a dead letter queue makes the problem visible and enables fault correction processes to be used, or the messages to be redirected, or messages communicating the occurrence of the message delivery problems to be sent back to the origin of the message.
Clearly, some means of processing the dead-letter queue is required. Secondly, it may not be possible to store all problem messages on a dead-letter queue. There may be no dead-letter queue defined, or if defined it may be full or there may be access problems. In such cases, it is again necessary to reject the entire batch to preserve once-and-once-only delivery semantics. That is, all messages within the batch that have been transferred to storage on the receiving node are backed out. The consequence of this is that all messages which need to be transferred to this next node (whether or not it is their final destination) are delayed, including the messages which could have been successfully stored on this next node. Such message transfer delays can be a significant problem if the reason for the delay persists.
Furthermore, it may be a requirement of the system that messages with the same ultimate destination are never sent out of order. In such cases, use of a dead-letter queue is unacceptable unless there is provided means to correct the ordering before the messages are finally delivered.