1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, apparatus, and products for completion processing for data communications instructions in a distributed computing environment.
2. Description of Related Art
The development of the EDVAC computer system of 1948 is often cited as the beginning of the computer era. Since that time, computer systems have evolved into extremely complicated devices. Today's computers are much more sophisticated than early systems such as the EDVAC. Computer systems typically include a combination of hardware and software components, application programs, operating systems, processors, buses, memory, input/output devices, and so on. As advances in semiconductor processing and computer architecture push the performance of the computer higher and higher, more sophisticated computer software has evolved to take advantage of the higher performance of the hardware, resulting in computer systems today that are much more powerful than just a few years ago.
Data communications is an area of computer technology that has experienced advances, and modes of data communications today effectively implement distributed computing environments. In the 1990s, a consortium that included Apollo Computer (later part of Hewlett-Packard), IBM, Digital Equipment Corporation, and others developed a software system that was named ‘Distributed Computing Environment.’ That software system is mentioned here for the sake of clarity to explain that the term ‘distributed computing environment’ as used in this specification does not refer that software product from the 1990s. As the term is used here, ‘distributed computing environment’ refers to any aggregation of computers or compute nodes coupled for data communications through a system-level messaging layer in their communications protocol stacks, where the system-level messaging layer provides ‘active’ messaging, messaging with callback functions. Implementations of such system-level messaging include messaging layers in client-server architectures, messaging layers in Symmetric Multi-Processing (‘SMP’) architectures with Non-Uniform Memory Access (‘NUMA’), and messaging layers in parallel computers, including Beowulf clusters and even supercomputers with many compute node coupled for data communications through such system-level messaging. Common implementations of system-level messaging for parallel processing include the well known Message Passing Interface (‘MPI’) and the Parallel Virtual Machine (‘PVM’). Both of these permit the programmer to divide a task among a group of networked computers, and collect the results of processing. Examples of MPI implementations include OpenMPI and MPICH. These and others represent examples of implementations of system-level messaging that can be improved for completion processing for data communications instructions in a distributed computing environment according to embodiments of the present invention.
Parallel computing is another area of computer technology that has experienced advances. Parallel computing is the simultaneous execution of the same application (split up and specially adapted) on multiple processors in order to obtain results faster. Parallel computing is based on the fact that the process of solving a problem often can be divided into smaller jobs, which may be carried out simultaneously with some coordination. Parallel computing expands the demands on middleware messaging beyond that of other architectures because parallel computing includes collective operations, operations that are defined only across multiple compute nodes in a parallel computer, operations that require, particularly in supercomputers, massive messaging at very high speeds. Examples of such collective operations include BROADCAST, SCATTER, GATHER, AND REDUCE operations.
Many data communications network architectures are used for message passing among nodes in parallel computers. Compute nodes may be organized in a network as a ‘torus’ or ‘mesh,’ for example. Also, compute nodes may be organized in a network as a tree. A torus network connects the nodes in a three-dimensional mesh with wrap around links. Every node is connected to its six neighbors through this torus network, and each node is addressed by its x,y,z coordinate in the mesh. In a tree network, the nodes typically are connected into a binary tree: each node has a parent and two children (although some nodes may only have zero children or one child, depending on the hardware configuration). In computers that use a torus and a tree network, the two networks typically are implemented independently of one another, with separate routing circuits, separate physical links, and separate message buffers.
A torus network lends itself to point to point operations, but a tree network typically is inefficient in point to point communication. A tree network, however, does provide high bandwidth and low latency for certain collective operations, message passing operations where all compute nodes participate simultaneously, such as, for example, an allgather.
There is at this time a general trend in computer processor development to move from multi-core to many-core processors: from dual-, tri-, quad-, hexa-, octo-core chips to ones with tens or even hundreds of cores. In addition, multi-core chips mixed with simultaneous multithreading, memory-on-chip, and special-purpose heterogeneous cores promise further performance and efficiency gains, especially in processing multimedia, recognition and networking applications. This trend is impacting the supercomputing world as well, where large transistor count chips are more efficiently used by replicating cores, rather than building chips that are very fast but very inefficient in terms of power utilization.
At the same time, the network link speed and number of links into and out of a compute node are dramatically increasing. IBM's BlueGene/Q™ supercomputer, for example, will have a five-dimensional torus network, which implements ten bidirectional data communications links per compute node—and BlueGene/Q will support many thousands of compute nodes. To keep these links filled with data, DMA engines are employed, but increasingly, the HPC community is interested in latency. In traditional supercomputers with pared-down operating systems, there is little or no multi-tasking within compute nodes. When a data communications link is unavailable, a task typically blocks or ‘spins’ on a data transmission, in effect, idling a processor until a data transmission resource becomes available. In the trend for more powerful individual processors, such blocking or spinning has a bad effect on latency.
Of course if an application blocks or ‘spins’ on a data communications program, then the application is advised immediately when the transfer of data pursuant to the instruction is completed, because the application cease further processing until the instruction is completed. But that benefit comes at the cost of the block or the spin during a period of time when a high performance application really wants to be doing other things, not waiting on input/output. There is therefore a trend in the technology of large scale messaging toward attenuating this need to spin on a data communications resource waiting for completion of a data transfer. There is a trend toward supporting non-blocking data communications instructions that allow an application to fire-and-forget an instruction and check later with some infrastructure to confirm that the corresponding data transfer has actually been completed. The trend is to track data transfers with message sequence numbers stored temporarily in communications buffers in messaging infrastructure. If a message can be immediately completed, its sequence number can be flagged as completed, and the application can call down into the messaging infrastructure to figure out whether the message data has been sent. For messages that take more time, a completion descriptor can be created and marked later to advise the application when a transfer is completed. All these prior art methods of completion processing for data communications instructions, however, require significant data processing overheads, maintenance of additional data structures and data, additional system calls from the application to check on instruction completion.