This invention relates to the field of message-passing in a distributed-memory parallel computer network for large data processing applications such as for computation in the field of life sciences, and more particularly relates to a DMA engine constructed for handling repeating communication patterns within individual compute nodes comprising a parallel computer system comprising a plurality of interconnected compute nodes.
A message-passing data network serves to pass messages between compute nodes comprising distributed-memory parallel computer system, e.g., BlueGene/P ultrascalable Petaflop Parallel Supercomputer, by IBM Corporation. Each compute node comprising such a network or system includes one or more computer processors that run local instances of applications operating globally on local memory at the compute node, and performs local operations independent of the other compute nodes. Compute nodes can act in concert by passing messages between each other over the distributed-memory parallel computer system. The local instances of the applications also use other local devices, such as a DMA network interface, which is described in detail below. The global application operates across the multiple compute nodes comprising the parallel computer system to coordinate global actions and operations across the nodes, including passing messages therebetween.
The hardware comprising each compute node within the parallel computer system includes a DMA network interface. During normal parallel computer system operation, the local instances of the global application running on a local compute node may send a message to another compute node by first injecting the message into its DMA network interface. The DMA network interface forwards the message onto the network, which passes the message to the DMA network interface on the receiving compute node. The message is received by the local instance of the software program application at same receiving compute node.
Various network interfaces, as distinguished from DMA network interfaces, are known that accept a description, or message descriptor of each message to be exchanged or passed within a parallel computer system. Such known compute node network interfaces are described in: Welsh, et al., “Incorporating Memory Management into User-Level Network Interfaces”, TR97-1620, Dept. Computer Science, Cornell Univ., Ithaka, N.Y., 1997 (“the Welsh reference”); Pratt, et al., Arsenic: A user-accessible gigabit ethernet interface; Proceedings of the Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM-01), pages 67-76, April 2001 (“the Pratt reference”); and U.S. Pat. No. 5,751,951 to Osborne, et al., issued May 12, 1998 and entitled: Network Interface (“the '951 patent”).
FIG. 2 herein is a schematic diagram depicting a conventional network interface that includes two (or more) Injection FIFOs (10, 20) that are controlled by a message-passing application running on a conventional parallel computer system (not shown). Each Injection FIFO is arranged to provide storage for one or more message descriptors that are injected into the conventional interface from a compute node. Injection FIFOs 10 provide storage for 4 message descriptors: 11, 12, 13, 14; and Injection FIFOs 20 provide storage for 4 message descriptors: 21, 22, 23, 24.
The message descriptors for the message passing are typically maintained in a fixed predetermined area, which may be in application memory. Each message descriptor typically includes a pointer to the local application data to be sent, the length of the data to be sent and a description of the remote compute node and associated remote memory locations slated to receive the data at the remote compute node. Given a message descriptor, the conventional network interface at the sending compute node sends the corresponding message.
The conventional arts including the above-cited prior art references are constructed to operate by handling, and handing off into one of perhaps several Injection FIFO each individual message descriptor individually, or as separate operations. The number of Injection FIFOs and their properties are fixed in prior art. In such known conventional network interfaces, the Injection FIFOs are known to comprise some part of the conventional network interface. Thus, a local instance of the global application sending the data must first individually insert each message descriptor into an Injection FIFO at its conventional network interface. The efforts required by the sending node using the local instance of the application to carry such known messaging and protocol is proportional to the number of messages to be exchanged. Accordingly, both the global application and the performance of the local instances of the application running at a compute node decreases with the number of messages to be sent/received.
This invention, broadly, concerns the use and re-use of multiple message descriptors. A message descriptor describes a message to a single node. Whether an individual message descriptor is simple or sophisticated is not of concern here. For example, the Welsch reference provides for message-passing operation that utilizes complex message descriptors using a technique defined as static chaining, which includes the combining of linked list representations of multiple frames of data for the same connection (compute node) by the host before the host queues the chain in the transmit input queue. The Welsch reference articulates that benefit of such a technique includes that only one frame descriptor of the chain, or linked list representations is required to describe multiple frames for the same connection (compute node). Such a chain or frame descriptor in the Welsch reference corresponds to a single message descriptor.
As described above, the conventional arts including the above-cited prior art references are constructed to operate by handling, and handing off each individual message descriptor individually, or as separate operations. The prior art techniques may be extended to better serve an iterative application, which repeatedly performs a computation phase followed by a communication phase. Such iterations may comprise a part of many global applications for simulating a physical system, a financial system or other complex systems, where the communication pattern is the same and only the application data transferred differs between iterations. The same message descriptors are used in the known prior art applications to communicate the changing application data. That is, the same message descriptors in the Injection FIFO are created in the first iteration, and re-used for subsequent iterations.
In the prior art, re-using the message descriptors requires a local processor core to copy each message descriptor into the local (conventional) network interface. In other words, the prior art requires a processor core to copy the contents of the message buffer into the local (conventional) network interface. In this invention, the global message-passing application initiates the entire communication phase by providing the conventional network interface with a brief description of the Injection FIFO. This invention thus frees the processor core from copying each message descriptor into the local (conventional) network interface. The processor core thus is made available for other work for the application.
MPI is the industry-standard message-passing interface, and used in parallel computer systems. An MPI program consists of autonomous processes, executing their own code, which need not be identical. Typically, each process or application communicates via calls to MPI communication primitives, where each process executes in its own and shared memory. Such message passing allows the local processors comprising the compute node, and applications running thereon (a thread or instance of the global application or process) to cooperate with each other. Generally speaking, an MPI is an interface designed to allow a user to code data such that the local processors at the compute nodes comprising the network are able to send and receive data and information to coordinate the disposition of the global application or process. MPI is available on a wide variety of platforms ranging from massively parallel systems (IBM, Cray, Intel Paragon, etc.) to networks of workstations.
The use of Direct Memory Address (DMA) technology provides for reducing CPU (processor) workload in the management of memory operations required for messaging in any computer system, and are particularly relied on in large parallel computer systems. DMA, or DMA engines, work in conjunction with the local application or thread implementing the MPI application, for example, within a conventional network interface such as that of prior art FIG. 2. Workload that would normally have to be processed by a CPU at a compute node is instead handled by the DMA engine. The use of DMA technology in large parallel computer systems is limited somewhat by such system's inherent the need for tight hardware control and coordination of memory and message-passing operations. That is, required tight coupling between memory operations and CPU operations poses some challenges, including the need for a sending side compute node (a source compute node originating a message to be passed to another or multiple other compute nodes) to have awareness of the receiver node's remote address spaces, multiple protection domains, locked down memory requirements (also called pinning), notification, striping, recovery models, etc.
In parallel computer, such as IBM, Inc.'s BlueGene/P, a “Rendezvous” protocol is often used to send long messages between compute nodes. Following the Rendezvous protocol, a source compute node (a thread or instance of the global application running on the sending side) sends a long message by first passing a request to send (RTS) packet to the target compute node. The RTS contains information or data identifying the source compute node and the message being sent, e.g., number of total bytes. The target compute node replies to the RTS by generating and sending a “CTS (clear to send)” packet, assuming the target compute node is able to receive. The CTS includes information or data describing the target compute node (destination side) in order to receive the entire message. Finally, the source compute node sends self-describing “data” packets to the Target node, which can arrive in any order. The packet transfer continues until the entire message has been sent. RTS packet transfers, or message transfers following the Rendezvous protocol, are ordered with respect to the transfer of other messages out of a compute node, or into a compute node, e.g., with respect to other rendezvous or eager messages.
What would be desirable in the field of parallel computer systems and their design, and in particular in parallel computer systems including network interfaces or DMA engines constructed to perform remote message passing efficiently for repeated communication patterns in an application. For such a repeated communication pattern, the desired amount of effort or overhead required by an application running on a compute node to carry out the messaging is fixed, regardless of the number of messages in order to contain the effort and attention required by the local compute node in handling the messaging via the Injection FIFOs to maintain local performance without degradation even as the number of messages increases.