In a distributed system including nodes (henceforth called network nodes) connected via a communications network, applications that run on the network nodes communicate with each other by sending and receiving messages over the communication network. A portion of memory in the applications' address space on the network node it is running on is used as a receive buffer region for receiving the incoming packet data for the application. The received messages wait in the buffer to be processed by the application process that the message is destined for. The receive buffer region may be divided on a per application process basis. Since an application process may communicate with many other application processes running on other network nodes at a given time and receive messages from these other application processes, the receive buffer associated with the application process is thus shared by all the incoming messages destined for an application process.
Efficiently allocating the receive buffers on a network node present many challenges. These challenges include, for example, determining the size of the receive space that should be reserved, what remedial actions should be taken upon a no-free-buffer condition, and the potential effect of one congested process on other non-congested processes. Receive buffers may also be allocated dynamically.
Dynamic allocation grows the receive buffer on a needed basis. Complexity in the allocation may increase along with application computations. Thus, applications may need to perform complex buffer management of the receive space with this approach.
Fixed allocation of a reasonably small sized constant buffer size provides a viable solution. The amount to allocate may be based on expected performance of the application, size and number of expected incoming messages (sometimes called the working set). However, deciding on the appropriate size of the receive buffer presents a difficult choice. For example, allocating too little receive space may result in poor performance and more congestion, allocating too much space may result in low utilization, allocating on a worst case basis may turn out to be a non-scalable solution and may also result in poor utilization. Moreover, fixed allocation should reflect the priority of the processes, the typical working set size, as well as their respective typical or average message sizes that need to be placed into the buffer.
A credit-based solution allocates the receive space as a credit limit to each source sending process. The entire receive space is divided per potential source processes that may send messages to the receiving process. The sending process at the source host keeps track of how much in-flight data it has pending (not acknowledged) between it and all the destination hosts. It does not allow more data than what the credit permits to be in-flight per source-destination pair. The source sending process may not fully utilize the assigned credit limit. The drawback of this approach is that if a receiving process receives messages from only a small number of source processes at any given time, much of the receive buffer space will be poorly utilized most of the time.
In rendezvous based systems, a sending side sends an envelope of a message to be sent to the receiving side. The envelope contains information regarding the message. The receiving side sends a message back to the sending side when the receiving side is ready to receive and has reserved buffer space for the expected message. As such, extra round-trip overhead is incurred for each message transmitted and extra state information is kept on both sides impacting latency and overall performance of the system.
In application level acknowledgment protocol, an application process drops an incoming message and sends a negative acknowledgment to the sending side when the receive buffer space is all in use (i.e. full). The sending side takes appropriate recovery actions based on the negative acknowledgment and resumes transmission upon a notification from the destination side that it is safe to resume sending. This scheme requires the application to be involved in the flow control and buffer management of the receive space and also requires that the messages be held for a longer time by the application on the send side for potential retransmission later.
In a timeout based system, the transport layer on the receiving side discards a packet if the destined application process neither processes nor acknowledges the message above a timing threshold. The sending side continues retransmitting the packet unless an acknowledgment is received.