In many multiple-processor computer systems it is important for processes or tasks running on one computer node (sometimes called the sender or sending computer) to be able to transmit a message or data to another computer node (sometimes called the receiver or receiving computer). A necessary aspect of such message passing is the allocation of buffers in the receiving computer's memory and the establishment of memory management and message transport mechanisms to enable the sending computer to remotely write the contents of a message into the memory of the receiving computer.
While some prior art system use a "streaming type messaging environment" in which space is allocated for storing received messages on the fly, as messages are received), the present invention is relevant to distributed computer systems using a "shared memory messaging environment" in which memory for storing messages is allocated in advance, assigned global addresses and exported to a other computer node.
Typically, most prior art systems use one of two models for setting up message receive buffers. In the first model the receiving computer sets up a number of message receive buffers in advance, each buffer having an associated fixed size, and then tells the sending node the location and size of each of those buffers. Each message receive buffer is used just once by the sending computer. When the sending computer needs additional buffers, it requests them from the receiving computer, or the receiving computer automatically allocates new buffers for the sending computer based on usage of the previously allocated buffers.
In the second model, each time the sending computer wants to send a message it sends a buffer allocation request to the receiving computer, which then allocates a buffer of the requested size and then sends a memory export message to the sending computer to inform it of the allocated buffer's location and associated "global address" range.
Generally, receive buffers cannot be reused by the sending node because the sending node does not know when the receiving node has finished processing the data in them. Typically, the sending node only receives acknowledgments of the successful receipt of each message, and thus the sending node has no basis for determining when a previously used receive buffer is available for re-use. As a result, each receive buffer is typically deallocated by the receiving node after the receiving node is finished processing the data in the buffer and a new buffer is allocated when the sending node needs one. The new buffer may, or may not, be in the exact same memory location as a previous buffer, but all the overhead of allocating the buffer and setting up the memory management unit table entries in both the receiving and sending nodes is incurred for each buffer allocation.
Also, allocation of receive buffers in advance is of limited utility because messages come in a virtually unlimited range of sizes. As a result, even in systems that set up some receive buffers in advance, messages requiring non-standard buffer sizes use the second model described above for allocating a receive buffer of the needed size.
An advantage of the prior art techniques described above, especially the second model, is that it makes efficient use of memory in the receiving node in that little memory is tied up in receive buffers that may be used seldom or never. Also, in systems with light message traffic, the CPU and communications overhead of setting up and tearing down receive buffers is relatively light. However, the system latencies caused by having to wait for a receive buffer to be requested and allocated before transmission of the message can be substantial, and those system latencies can indirectly result in degradation of system performance, especially in multiprocessor systems in which tasks are distributed over multiple processors and message traffic between the processors is heavy and forms an integral part of the data processing being performed.
In summary, there is a need for more efficient receive buffer allocation methodologies, especially in multiprocessor systems with heavy message traffic.
Referring to FIG. 1, there is shown a highly simplified representation of two computer nodes herein called Node A 50, and Node B 52. The computer at each node can be any type of computer. In other words, the particular brand, architecture and operating system is of no importance to the present discussion, so long as each computer node is configured to operate in a networked environment. Each computer node 50, 52 will typically include a central processing unit (CPU) 54, random access memory 56, an internal memory bus 58 and a communications interface 60, often called a network interface card (NIC). The computer nodes communicate with each other by transmitting messages or packets to each other via a network interconnect 62, which may include one or more types of communication media, switching mechanisms and the like.
Each computer node 50, 52 typically also has a non-volatile, non-random access memory device 64, such as a high speed magnetic disk, and a corresponding disk controller 66.
FIG. 2 shows a simplified representation of a conventional communications interface (or NIC) 60, such the ones used in the computer nodes of FIG. 1, showing only the components of particular interest. The NIC 60 typically includes two address mapping mechanisms: an incoming memory management unit (IMMU) 70 and an outgoing memory management unit (OMMU) 72. The purpose of the two memory management units are to map local physical addresses (PA's) in each computer node to global addresses (GA's) and back. Transport logic 74 in the NIC 60 handles the mechanics of transmitting and receiving message packets, including looking up and converting addresses using the IMMU 70 and OMMU 72.
The dashed lines between the memory bus 58 and the IMMU 70 and OMMU 72 represent CPU derived control signals for storing and deleting address translation entries in the two MMU's, typically under the control of a NIC driver program. The dashed line between the memory bus 58 and the transport logic 74 represents CPU derived control signals for configuring and controlling the transport logic 74.