The invention relates to parallel computer system, and more particularly relates to a remote messaging engine capable of supporting/sending multiple remote messages to multiple remote nodes comprising a parallel computer network of interconnected compute nodes, without need for compute node processor control of the remote messaging, triggered by a single send message from a source node. The remote messaging engine for multiple node remote messages, and the novel messaging operation provided thereby is set forth and described herein for the purpose of conveying the broad inventive concepts. The drawings and descriptions provided are not meant to limit the scope and spirit of the invention in any way.
Parallel computer systems include multiple compute nodes that each run threads of a global application program to accomplish tasks or processes, e.g., BlueGene/P ultrascalable Petaftop Parallel Supercomputer, by IBM Corporation. The individual compute nodes, and the instances of the global application running at the compute node carry out message passing to complete designated node-specific portions of the task or process. During message passing, an event is generated each time a message is received at a compute node comprising a parallel computer system. The compute node (and local processors) processes such events according to its inherent event-processing algorithm. In general, various devices often use a special type of event processing system for managing various messages. That is, to ensure such communication between various compute nodes comprising a parallel computer system, a standard known as message passing interface (MPI), defined by a group of organizations including various vendors and researchers is used.
MPI is the industry-standard message-passing interface. An MPI program consists of autonomous processes, executing their own code, which need not be identical. Typically, each process or application communicates via calls to MPI communication primitives, where each process executes in its own and shared memory. Such message passing allows the local processors comprising the compute node, and applications running thereon (a thread or instance of the global application or process) to cooperate with each other. Generally speaking, an MPI is an interface designed to allow a user to code data such that the local processors at the compute nodes comprising the network are able to send and receive data and information to coordinate the disposition of the global application or process. MPI is available on a wide variety of platforms ranging from massively parallel systems (IBM, Cray, Intel Paragon, etc.) to networks of workstations.
The use of Direct Memory Address (DMA) technology provides for reducing CPU (processor) workload in the management of memory operations required for messaging in any computer system, and are particularly relied on in large parallel computer systems. DMA, or DMA engines, work in conjunction with the local application or thread implementing the MPI application. Workload that would normally have to be processed by a CPU at a compute node is instead handled by the DMA engine. The use of DMA technology in large parallel computer systems is limited somewhat by such system's inherent need for tight hardware control and coordination of memory and message-passing operations. That is, required tight coupling between memory operations and CPU operations poses some challenges, including the need for a sending side compute node (a source compute node originating a message to be passed to another or multiple other compute nodes) to have awareness of the receiver node's remote address spaces, multiple protection domains, locked down memory requirements (also called pinning), notification, striping, recovery models, etc.
In parallel computer, such as the BlueGene/P, a “Rendezvous” protocol is often used to send long messages between compute nodes. Following the Rendezvous protocol, a source compute node (a thread or instance of the global application running on the sending side) sends a long message by first passing a request to send (RTS) packet to the target compute node. The RTS contains information or data identifying the source compute node and the message being sent, e.g., number of total bytes. The target compute node replies to the RTS by generating and sending a “CTS (clear to send)” packet, assuming the target compute node is able to receive. The CTS includes information or data describing the target compute node (destination side) in order to receive the entire message. Finally, the source compute node sends self-describing “data” packets to the Target node, which can arrive in any order. The packet transfer continues until the entire message has been sent. RTS packet transfers, or message transfers following the Rendezvous protocol, are ordered with respect to the transfer of other messages out of a compute node, or into a compute node, e.g., with respect to other rendezvous or eager messages.
As mentioned above, computer systems and in particular parallel computer systems such as BlueGene/P utilize DMA engines to asynchronously move data (e.g., message passing) between in-node memory and the communications network (other compute nodes). DMA engines operate under a set of constructs used by message passing libraries (as in MPI) to set up and monitor completion of DMA data transfers. In large parallel computer systems such as BlueGene/P, DMAs may be fabricated within or integrated into the same ASIC comprising the node's processors. As such, size is a consideration and therefore such DMAs often have finite resources, for example, byte counters to tracks the number of bytes sent or received in a DMA transfer, which must be managed wisely to maximize the exchange of the many messages (at the compute node comprising the DMA channel). In peak performance applications, many outstanding messages might be regularly outstanding that must be managed by the DMA engine before data transfers.
Message passing libraries used for DMA message transfer in parallel computer systems inefficiently implement known rendezvous protocol for some applications. Before the instant invention, it has been unknown for conventional parallel computer systems to operate with DMA engines that have inherently limited numbers of byte counters and other registers because of size constraints. Until the recent development of IBM's BlueGene/P ultrascalable Petaflop Parallel Supercomputer, which includes compute nodes with DMA engines integrated within a single ASIC core (and therefore a limited number of byte counters), efficiency utilizing such a limited number of byte counters was not a priority for computer designers. Hence, other versions of the rendezvous protocol to provide for a DMA engine's efficient management of a limited number of counters are unknown, and would be desirable for use in a supercomputer such as BlueGene/P.
For that matter, commonly-owned co-pending U.S. patent application Ser. No. (YOR820070343), entitled: DMA Shared Byte Counter In A Parallel Computer, filed concurrently and incorporated by reference herein, discloses a DMA engine for use in a parallel computer system, a method for passing messages using such a DMA engine in a parallel computer system and a parallel computer system utilizing the novel DMA engine for sharing of byte counters by multiple messages. The aforementioned DMA Shared Byte Counter, however, is unable to determine (the application thread running at a compute node and using the local DMA engine for message-passing) whether a message that is using a shared byte counter has been completed other than at the time when it can be determined that all of the messages (message packet transfers) sharing the shared byte counter have completed.
Direct memory access (DMA) allows certain hardware sub-systems within a computer or computer system to access system memory for reading and/or writing independent of the central processing unit, or multiple central processing units in the case of parallel computers and computer systems. DMA use is made by disk drive controllers, graphics cards, network cards, sound cards and like devices. Computer systems that employ DMA channels can transfer data to and from devices with much less CPU overhead as compared to computer systems without a DMA channel.
A DMA transfer comprises copying a block of memory from one device to another (comprising the computer system). The CPU initiates the DMA transfer, but the DMA carries out the task. For what is known in the art as “third party” DMA, for example, as used in conjunction with conventional ISA bus operation, a DMA controller or engine that is normally part of the motherboard chipset performs the transfer. For example, the BlueGene/P, a parallel multi-computer system by International Business Machines (IBM), includes a DMA engine integrated onto the same chip as the processors (CPUs), cache memory, memory controller and network logic.
DMAs are used conventionally to copy blocks of memory from system RAM to or from a buffer on the DMA device w/o interrupting the processor, which is quite important to high-performance embedded systems. DMA is also used conventionally to offload expensive memory operations, such as large copies from the CPU to a dedicated DMA engine. For example, a “scatter gather” DMA allows the transfer of data to and from multiple memory areas in a single DMA transaction. Scatter gather DMA chains together multiple simple DMA requests in order to off-load multiple input/output interrupt and data copy tasks from the processor or CPU.
In a DMA engine for transferring data from a network interface the CPU provides destination address for moving data from the network interface to the memory. Length of packets or their semantics received from the network interface are not know in advance. Multiple packets can contain various parts of a single data message/transfer, such as in MPI messages. A DMA engine moves received packets to destination addresses without Packet reordering if received out-of-order. Storing packets in a single continuous address space if packets are part of a single message. Some network protocols such as MPI with rendezvous protocol support acknowledgement means to allow to the other party transfer of large messages using a number of packets. Destination address for the whole data block in a continuous address space at the receiver side provided by the receiver ahead of time/at the beginning of transmission to the transmitter side.
What would be desirable in the field of parallel computer systems and their design, and in particular in parallel computer systems including DMA engines constructed to perform remote message sends to remote compute nodes comprising the parallel computer system automatically in hardware without core processor (e.g., CPU) involvement, triggered by a single message from a source compute node.