The Blue Gene is a series of supercomputers that can reach operating speeds in the PFLOPS (PetaFLOPS) range, with low power consumption. To date, there have been three Blue Gene generations: Blue Gene/L, Blue Gene/P, and Blue Gene/Q. The Blue Gene (BG) systems have several times led rankings of the most powerful and power efficient supercomputers, and the project was awarded the 2009 U.S. National Medal of Technology and Innovation.
FIG. 1 shows an exemplary configuration of the hierarchy 100 of BG processing units, beginning with individual compute chips 102, each having two processors and 4 MB memory, and progressing to an exemplary system 104 of 64 cabinets 104 encompassing 65,536 nodes with 131,072 CPUs. The BG/Q can potentially scale to 100 PFLOPS, and the Sequoia machine at Lawrence Livermore is a BG/Q with a 20 PFLOPS capability and over 1.6 million cores. However, the exact number of units or architecture is merely representative and is not particularly significant to understand the present invention, and the present invention can be implemented on configurations other than Blue Gene configurations.
The Blue Gene/P and Blue Gene/Q machines have a torus interconnect for application message passing, with the Blue Gene/P (BG/P) using a three-dimensional torus network, while the Blue Gene/Q (BG/Q) uses a five-dimensional torus network. Thus, in a BG/P machine, a core node location could be identified, such as <A,B,C>; in a BG/Q machine, a core node location could be identified in a five-axis coordinate notation such as <A,B,C,D,E>. It should be noted that such coordinate axes notation can be considered an ordered arrangement, so that one of skill in the art would be able to understand how the description “next higher dimension” or “adjacent dimension” or “dimension with a predetermined association with a dimension used to transmit a packet” or “predetermined dimension” or “dimension different from the dimension upon which the instruction was received” would have meaning.
Moreover, in multiprocessors with a torus configuration, there are interconnections between dimensions but there is also a wraparound effect in each dimension. Again, exact details of the architecture or manufacturer of the computer system should not be considered as limiting, since the present invention can be implemented on any multidimensional multi-processor system, meaning that the processor cores are arranged in an interconnected multidimensional configuration, including configurations interconnected as a torus, as long as multidimensional system uses an inter-nodal interface device that can operate autonomously of its associated processor.
The DMA (Direct Memory Access) unit on the BG/P and the MU (Messaging Unit) device on the BG/Q offload communications from the processor cores as a mechanism to intercommunicate between the cores. These devices support three modes of communication: 1) memory FIFO (First In First Out), 2) direct put, and 3) remote get.
Memory FIFO messages, such as Remote Direct Memory Access (RDMA), move data packets to a remote memory buffer, and the direct put instruction moves data payload directly to a remote memory buffer. A remote get can move remote data payload to a local buffer. In a remote get operation, the payload of the remote get contains a direct put descriptor which can initiate a put back to the node that initiates the remote get or to any other node.
In addition, the torus network instruction set allows packets to be deposited along a line of the torus. Broadcast on a 3D rectangle partition can be done by, for example, a deposit bit send along the X-direction, followed by a deposit bit send along the Y direction and finally a deposit bit send along the Z direction. In the conventional broadcast mechanisms, each of the intermediate steps requires processor interaction to trace incoming data and then initiate the next dimension, which processing the present inventors have recognized as adversely affecting the latency of the broadcast operation.
These operations are representative of a larger number of instructions described in the Message Passing Interface (MPI), a standardized, portable message-passing system designed by a group of researchers for parallel computers in the early 1990s. This standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable message-passing programs in different computer programming languages such as Fortran, C, C++ and Java. Although MPI is not sanctioned by any major standards group, it has become a de facto standard that has fostered the development of a more standardized parallel software industry, by encouraging the development of portable and scalable large-scale parallel applications. MPI is a language-independent communications protocol used for programming parallel computers and supports both point-to-point and collective communication.
Overall, MPI provides a rich range of abilities, including communicator objects that selectively connect groups of processes in the MPI session. A number of other important MPI functions, referred to as point-to-point operations, involve communication between two specific processes. Collective functions involve communication among all processes in a process group, which can mean the entire process pool or a program-defined subset. For example, a broadcast function can take data from one node and send it to all processes in the process group. A reverse operation can take data from all processes in a group, pedal n an operation (such as summing), and store the results on one node. Such collective functions can be useful at the start or end of a large distributed calculation, where each processor operates on a part of the data and then combines it into a result.
Asynchronous one-sided collectives that do not involve participation of the intermediate and destination processors are critical for achieving good performance in programming paradigms such as Charm++, UPC, etc. For example, in an asynchronous one-sided broadcast the root initiates the broadcast, and all destination processors receive the broadcast message without any intermediate nodes forwarding the broadcast message to other nodes. Thus, in the above-mentioned 3D rectangle broadcast there are three phases of processor involvement as each deposit bit line broadcast can only propagate data on one dimension at a time.
The present inventors have recognized that efficiency can be improved for broadcast instruction propagation in a multiprocessor having a plurality of nodes interconnected in a multidimensional (e.g., N dimensions, where N>1) configuration, each node having a processor and at least one associated inter-nodal interface device used to offload data from that node to other nodes in the system, if the processors at the intermediate nodes can be relieved of participation in implementing such broadcast instruction propagation.