Traditional message passing in a memory mapped computer network environment employs programmed I/O and DMA (direct memory access) operations. The present invention unifies these two methods to achieve the low CPU overhead of DMA for both programmed I/O and DMA, using a "write only" model of communication that is "ordering barrier" free. Richer and more reliable communication primitives, such as RPC, RMI and the like, can be built on top of the communication mechanisms of the present invention.
The present invention is directly primarily at message passing between the nodes of a cluster, where the term "cluster" means a set of computer system nodes that are interconnected by an interconnection fabric that exhibits the properties of a highly reliable, very low bit error rate, memory system interconnect. Usually, all the nodes in a cluster are located in the same room or building. However, the present invention may find wider use as the reliability of communications between more distant nodes improves.
Referring to FIG. 1, the context of the present invention is a distributed computer system 100 in which two or more computer system nodes 102 are interconnected by a communication network 104. Each computer system node 102 includes a network interface card (NIC) 106, and one or more distinct CPUs 106. The particular implementation of the computers at each node is irrelevant to the present invention, so long as the computer (or computers) at each node is capable or performing both programmed I/O and DMA operations and includes a network interface 106. The nodes 102 of the system can include single processor nodes, parallel processor nodes and symmetric processor (SMP) nodes.
For the purposes of this document, the term "programmed I/O" or "PIO" is defined to mean a data transfer from a first, local memory mapped address location in a first node to a second, remote memory mapped location, call called destination location, in a second node. PIO is typically accomplished by executing a load instruction to load data from the first memory mapped location into a local register, and then executing a store instruction to store the data from the local register to the second memory mapped location. If the data is already available in a local register due to the result of a prior computation, only the store instruction is needed to transfer it to the destination location. In the present invention, the term "PIO" refers to both the one and two CPU instruction versions, depending on the computational context in which the PIO operation is performed. Similarly, the terms "PIO instruction" and "PIO command" can mean one or two CPU instructions, depending on the context.
Thus programmed input/output (PIO) is a method of data transfer in which the CPU 108 of the sending device literally executes one or two instructions for directly controlling the transfer of each "data chunk" from a local memory location to a remote memory location. When PIO is used in a memory mapped network, a PIO store instruction executed in one computer can directly write data into the memory of another computer. This "remote write" capability, which is well known to those skilled in the art, will be explained in more detail below with respect to FIGS. 2, 3 and 4.
Typically, a "data chunk" is the amount of data that can be transferred over the network 104 as a single atomic action, from the viewpoint of the CPU. If the atomic unit of data transfer is 64 bytes, then transmitting a message 128 bytes in length would take two PIO commands and transmitting any message whose length is 64 bytes or less would take only one PIO command.
PIO is well known to be efficient for the transmission of short messages, in part because the message can be written directly to any memory location in destination node that has been exported to the sending node. The destination location for a PIO operation does not need to be page aligned, and thus it is often unnecessary for the receiving system to copy the received message before processing it. This avoidance of making a local destination copy helps make the use of PIO very efficient.
However, PIO is inefficient for the transmission of long messages. For instance, transmitting a "one page" message of 8K (8096) bytes in a system with a 64 byte atomic unit of data transfer would require 128 PIO commands (i.e., 128 pairs of CPU load and store instructions) to read data from a source location and to write the data to the memory mapped destination location. Thus, transferring even a single memory page using PIO ties up the sending system's CPU for hundreds of CPU cycles. Generally, most computer systems handle "long" data transfers (typically more than the amount of data that could be transferred with 20 or so PIO commands) through the use of DMA operations.
DMA operations are performed by a hardware assisted data transfer mechanism (herein called DMA logic) in which "control descriptors" or "control blocks" are first established in both the sending and receiving system nodes. The control descriptors define the starting address of the data source, the starting address of the data destination, the amount of data to be transferred, and various control flags for invoking interrupts, acknowledgment signal mechanisms and the like at the conclusion of the data transfer. Generally, both the source and destination locations of a DMA operation must be page aligned. After the control descriptors have been established, which can take anywhere from ten to a hundred or so CPU instructions depending on the implementation, the data transfer is handled entirely by the DMA logic, freeing the CPU to perform other operations while the DMA logic handles the data transfer.
DMA is efficient because it does not burden processor. However, DMA cannot be used to transmit short message headers to locations that are not page aligned. Thus, when a "multipart message" is transmitted to a destination location using DMA it is often necessary for the receiving system to copy the various components of the message to their respective "real" destination locations, some of which are page aligned and some of which are not page aligned. This receive side copying requirement can substantially reduce the efficiency of using DMA to transfer messages between computer nodes.
For the purposes of this document, the term "multipart message" is defined to mean a message having at least two distinct components, each of which must be transmitted to a different respective destination memory location. Typically, at least two of the respective destination memory locations are not contiguous with each other since one of the respective destination memory locations is usually a page aligned receive buffer and another is usually a slot in a queue or array data structure.
FIG. 2 shows a simplified representation of a conventional communications interface (or NIC) 106, such the ones used in the computer nodes 102 of FIG. 1, showing only the components of particular interest. The NIC 106 typically includes two address mapping mechanisms: an incoming memory management unit (IMMU) 121 and an outgoing memory management unit (OMMU) 122. The purpose of the two memory management units are to map local physical addresses (PA's) in each computer node to global addresses (GA's) and back. Transport logic 124 in the NIC 106 handles the mechanics of transmitting and receiving message packets, including looking up and converting addresses using the IMMU 121 and OMMU 122.
The dashed lines between the memory bus 110 and the IMMU 121 and OMMU 122 represent CPU derived control signals for storing and deleting address translation entries in the two memory management units (MMU's), typically under the control of a NIC driver program. The dashed line between the memory bus 110 and the transport logic 124 represents CPU derived control signals for configuring and controlling the transport logic 124.