1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, systems, and products for broadcasting a message in a parallel computer.
2. Description Of Related Art
The development of the EDVAC computer system of 1948 is often cited as the beginning of the computer era. Since that time, computer systems have evolved into extremely complicated devices. Today's computers are much more sophisticated than early systems such as the EDVAC. Computer systems typically include a combination of hardware and software components, application programs, operating systems, processors, buses, memory, input/output devices, and so on. As advances in semiconductor processing and computer architecture push the performance of the computer higher and higher, more sophisticated computer software has evolved to take advantage of the higher performance of the hardware, resulting in computer systems today that are much more powerful than just a few years ago.
Parallel computing is an area of computer technology that has experienced advances. Parallel computing is the simultaneous execution of the same task (split up and specially adapted) on multiple processors in order to obtain results faster. Parallel computing is based on the fact that the process of solving a problem usually can be divided into smaller tasks, which may be carried out simultaneously with some coordination.
Parallel computers execute parallel algorithms. A parallel algorithm can be split up to be executed a piece at a time on many different processing devices, and then put back together again at the end to get a data processing result. Some algorithms are easy to divide up into pieces. Splitting up the job of checking all of the numbers from one to a hundred thousand to see which are primes could be done, for example, by assigning a subset of the numbers to each available processor, and then putting the list of positive results back together. In this specification, the multiple processing devices that execute the individual pieces of a parallel program are referred to as ‘compute nodes.’ A parallel computer is composed of compute nodes and other processing nodes as well, including, for example, input/output (‘I/O’) nodes, and service nodes.
Parallel algorithms are valuable because it is faster to perform some kinds of large computing tasks via a parallel algorithm than it is via a serial (non-parallel) algorithm, because of the way modern processors work. It is far more difficult to construct a computer with a single fast processor than one with many slow processors with the same throughput. There are also certain theoretical limits to the potential speed of serial processors. On the other hand, every parallel algorithm has a serial part and so parallel algorithms have a saturation point. After that point adding more processors does not yield any more throughput but only increases the overhead and cost.
Parallel algorithms are designed also to optimize one more resource the data communications requirements among the nodes of a parallel computer. There are two ways parallel processors communicate, shared memory or message passing. Shared memory processing needs additional locking for the data and imposes the overhead of additional processor and bus cycles and also serializes some portion of the algorithm.
Message passing processing uses high-speed data communications networks and message buffers, but this communication adds transfer overhead on the data communications networks as well as additional memory needed for message buffers and latency in the data communications among nodes. Designs of parallel computers use specially designed data communications links so that the communication overhead will be small but it is the parallel algorithm that decides the volume of the traffic.
Many data communications network topologies are used for message passing among nodes in parallel computers. Such network topologies may include for example, a tree, a rectangular mesh, and a torus. In a tree network, the nodes typically are connected into a binary tree: each node typically has a parent and two children (although some nodes may only have zero children or one child, depending on the hardware configuration). A tree network typically supports communications where data from one compute node migrates through tiers of the tree network to a root compute node or where data is multicast from the root to all of the other compute nodes in the tree network. In such a manner, the tree network lends itself to collective operations such as, for example, reduction operations or broadcast operations. The tree network, however, does not lend itself to and is typically inefficient for point-to-point operations.
A rectangular mesh topology connects compute nodes in a three-dimensional mesh, and every node is connected with up to six neighbors through this mesh network. Each compute node in the mesh is addressed by its x, y, and z coordinate. A torus network connects the nodes in a manner similar to the three-dimensional mesh topology, but adds wrap-around links in each dimension such that every node is connected to its six neighbors through this torus network. In computers that use a torus and a tree network, the two networks typically are implemented independently of one another, with separate routing circuits, separate physical links, and separate message buffers. Other network topology often used to connect nodes of a network includes a star, a ring, or a hypercube. While the tree network generally lends itself to collective operations, a mesh or a torus network generally lends itself well for point-to-point communications. Although in general each type of network is optimized for certain communications patterns, those communications patterns may generally be supported by any type of network.
As mentioned above, the tree network is optimized for collective operations. Some collective operations have a single originating or receiving process running on a particular compute node in an operational group. For example, in a ‘broadcast’ collective operation, the process on the compute node that distributes the data to all the other compute nodes is an originating process. In a ‘gather’ operation, for example, the process on the compute node that received all the data from the other compute nodes is a receiving process. The compute node on which such an originating or receiving process runs is referred to as a logical root.
The collective tree network supports efficient collective operations because of the low latency associated with propagating a logical root's message to all of the other nodes in the collective tree network. The low latency for such data transfers result from the collective tree network's ability to multicast data from the physical root of the tree to the leaf nodes of the tree. The physical root of the collective tree network is the node at the top of the physical tree topology and is physically configured to only have child nodes without a parent node. In contrast, the leaf nodes are nodes at the bottom of the tree topology and are physically wired to only have a parent node without any children nodes. Currently, when the logical root is ready to broadcast a message to the other nodes in the operational group, the logical root must first send the entire message to the physical root of the tree network, which in turn, multicasts the entire message down the tree network to all the nodes in the operational group. The drawback to this current mechanism is that the initial step of sending the entire message from the logical root to the physical root before any of the other nodes receive the message may delay the propagation of the message to all of the nodes in the operational group.