1. Field of the Invention
This invention generally relates to parallel computing systems, and more specifically, to embedding global barrier and collective networks in a torus network.
2. Background Art
Massively parallel computing structures (also referred to as “ultra-scale or “supercomputers”) interconnect large numbers of compute nodes, generally, in the form of very regular structures, such as mesh, lattices, or torus configurations. The conventional approach for the most cost/effective ultrascalable computers has been to use processors configured in uni-processors or symmetric multiprocessor (SMP) configurations, wherein the SMPs are interconnected with a network to support message passing communications. Today, these supercomputing machines exhibit computing performance achieving over one peraflops.
One family of such massively parallel computers has been developed by the International Business Machines Corporation (IBM) under the name Blue Gene. Two members of this family are the Blue Gene/L system and the Blue Gene/P system. The Blue Gene/L system is a scalable system having over 65,000 compute nodes. Each node is comprised of a single application specific integrated circuit (ASIC) with two CPUs and memory. The full computer system is housed in sixty-four racks or cabinets with thirty-two node boards, or a thousand nodes, in each rack.
The Blue Gene/L computer system structure can be described as a compute node core with an I/O node surface, where communication to the compute nodes is handled by the I/O nodes. In the compute node core, the compute nodes are arranged into both a logical tree structure and a multi-dimensional torus network. The logical tree network connects the compute nodes in a tree structure so that each node communicates with a parent and one or two children. The torus network logically connects the compute nodes in a three-dimensional lattice like structure that allows each compute node to directly connect with its closest 6 neighbors in a section of the computer.
In massively parallel computing structures, multiple network paradigms are implemented to interconnect nodes for use individually or simultaneously and include three high-speed networks for parallel algorithm message passing. Additional networks are provided for external connectivity used for Input/Output, System Management and Configuration, and Debug and Monitoring services for the supercomputer nodes. The high-speed networks preferably include n-dimensional Torus, Global Tree, and Global Signal configurations. The use of each of these networks may switch back and forth based on algorithmic needs or phases of algorithms. For example, parts of calculations may be performed on the Torus, or part on the global Tree which facilitates the development of new parallel algorithms that simultaneously employ multiple networks in novel ways.
With respect to the Global Tree network, one primary functionality is to support global broadcast (down-tree) and global reduce (up-tree) operations. Additional functionality is provided to support programmable point-to-point or sub-tree messaging used for input/output, program load, system management, parallel job monitoring and debug. This functionality enables “service” or input/output nodes to be isolated from the Torus so as not to interfere with parallel computation. That is, all nodes in the Torus may operate at the full computational rate, while service nodes off-load asynchronous external interactions. This ensures scalability and repeatability of the parallel computation since all nodes performing the computation operate at the full and consistent rate. Preferably, the global tree supports the execution of those mathematical functions implementing reduction messaging operations. Preferably, the Global Tree network additionally supports multiple independent virtual channels, allowing multiple independent global operations to proceed simultaneously. The design is configurable and the ratio of computation nodes to service nodes is flexible depending on requirements of the parallel calculations. Alternate packaging strategies allow any ratio, including a machine comprised of all service or input/output nodes, as would be ideal for extremely data-intensive computations.
A third network includes a Global Signal Network that supports communications of multiple asynchronous ‘signals’ to provide global logical “AND” or “OR” functionality. This functionality is specifically provided to support global barrier operations (“AND”), for indicating to all nodes that, for example, all nodes in the partition have arrived at a specific point in the computation or phase of the parallel algorithm, and, global notification (“OR”) functionality, for indicating to all nodes that, for example, one or any node in the partition has arrived at a particular state or condition. Use of this network type enables technology for novel parallel algorithms, coordination, and system management.
On previous generation BlueGene/L (BG/L) and BlueGene/P (BG/P) supercomputers, besides the high speed 3-dimension torus network, there are also dedicated collective and global barrier networks. They have the advantage of independence among different networks, but also have a significant drawback in terms of (1) extra high speed pins on chip, resulting in extra packaging cost, and (2) harder to design applicable partitioning in packaging because the 3 networks have a different topology.