The invention relates broadly to message passing in parallel computer systems using direct memory access (DMA) engines, and more particularly relates to a DMA engine for implementing a rendezvous protocol through use of a limited number of DMA reception and injection byte counters to support a large number of outstanding messages by sharing, a method for message passing in a parallel computer system including DMAs at local compute nodes comprising the system constructed with a limited number of injection and reception byte counters, and parallel computer system including such DMAs or DMA engine.
Parallel computer systems include multiple processors, or multiple compute nodes comprising processors, which run threads or local instances of global applications to accomplish tasks or processes, e.g., BlueGene/P ultrascalable Petaflop Parallel Supercomputer, by IBM Corporation. The local application or thread carries out the message passing to complete designated, node-specific portions of the task or process. At the compute nodes, “events” are generated each time a message is received, and are processed according to compute node's inherent event processing algorithm. To ensure proper communication between various compute nodes comprising a parallel computer system, a standard known as message passing interface (MPI) has developed. MPI is the industry-standard message passing interface.
An MPI program consists of autonomous processes, executing their own code, which need not be identical. Typically, each process or application communicates via calls to MPI communication primitives, where each process executes in its own and shared memory. Such message passing allows the local processors comprising the compute node, and applications running thereon (a thread or instance of the global application or process) to cooperate with each other. Generally speaking, an MPI is an interface designed to allow a user to code data such that the local processors at the compute nodes comprising the network are able to send and receive data and information to coordinate the disposition of the global application or process. MPI is available on a wide variety of platforms ranging from massively parallel systems (IBM, Cray, Intel Paragon, etc.) to networks of workstations.
The use of Direct Memory Address (DMA) technology provides for reducing CPU (processor) workload in the management of memory operations required for messaging in any computer system, and are particularly relied on in large parallel computer systems. DMA, or DMA engines, work in conjunction with the local application or thread implementing the MPI application. Workload that would normally have to be processed by a CPU at a compute node is instead handled by the DMA engine. The use of DMA technology in large parallel computer systems is limited somewhat by such system's inherent the need for tight hardware control and coordination of memory and message-passing operations. That is, required tight coupling between memory operations and CPU operations poses some challenges, including the need for a sending side compute node (a source compute node originating a message to be passed to another or multiple other compute nodes) to have awareness of the receiver node's remote address spaces, multiple protection domains, locked down memory requirements (also called pinning), notification, striping, recovery models, etc.
In parallel computer, such as the BlueGene/P, a “Rendezvous” protocol is often used to send long messages between compute nodes. Following the Rendezvous protocol, a source compute node (a thread or instance of the global application running on the sending side) sends a long message by first passing a request to send (RTS) packet to the target compute node. The RTS contains information or data identifying the source compute node and the message being. sent, e.g., number of total bytes. The target compute node replies to the RTS by generating and sending a “CTS (clear to send)” packet, assuming the target compute node is able to receive. The CTS includes information or data describing the target compute node (destination side) in order to receive the entire message. Finally, the source compute node sends self-describing “data” packets to the Target node, which can arrive in any order. The packet transfer continues until the entire message has been sent. RTS packet transfers, or message transfers following the Rendezvous protocol, are ordered with respect to the transfer of other messages out of a compute node, or into a compute node, e.g., with respect to other rendezvous or eager messages.
As mentioned above, computer systems and in particular parallel computer systems such as BlueGene/P utilize DMA engines to asynchronously move data (e.g., message passing) between in-node memory and the communications network (other compute nodes). DMA engines operate under a set of constructs used by message passing libraries (as in MPI) to set up and monitor completion of DMA data transfers. In large parallel computer systems such as BlueGene/P, DMAs may be fabricated within or integrated into the same ASIC comprising the node's processors. As such, size is a consideration and therefore such DMAs often have finite resources, for example, byte counters to tracks the number of bytes sent or received in a DMA transfer, which must be managed wisely to maximize the exchange of the many messages (at the compute node comprising the DMA channel). In peak performance applications, many messages might be regularly outstanding that must be managed by the DMA engine before data transfers.
Message passing libraries used for DMA message transfer in parallel computer systems inefficiently implement known rendezvous protocol for some applications. Before the instant invention, it has been unknown for conventional parallel computer systems to operate with DMA engines that have inherently limited numbers of byte counters and other registers because of size constraints. Until the recent development of IBM's BlueGene/P ultrascalable Petaflop Parallel Supercomputer, which includes compute nodes with DMA engines integrated within a single ASIC core (and therefore a limited number of byte counters), efficiency utilizing such a limited number of byte counters was not a priority for computer designers. Hence, other versions of the rendezvous protocol to provide for a DMA engine's efficient management of a limited number of counters are unknown, and would be desirable for use in a supercomputer such as BlueGene/P.
What would be desirable in the field of parallel computer systems and their design, and in particular in parallel computer systems including DMA engines constructed to operate with limited byte counter resources, is a DMA engine capable of supporting a large number of outstanding messages and detection of message completion in a parallel multi-computer system in which there is a DMA engine.