Field of the Invention
This invention generally relates to parallel computing systems, and more specifically, to routing I/O packets between compute nodes and I/O nodes in a parallel computing system.
Background Art
Massively parallel computing structures (also referred to as “ultra-scale or “supercomputers”) interconnect large numbers of compute nodes, generally, in the form of very regular structures, such as mesh, lattices, or torus configurations. The conventional approach for the most cost/effective ultrascalable computers has been to use processors configured in uni-processors or symmetric multiprocessor (SMP) configurations, wherein the SMPs are interconnected with a network to support message passing communications. Today, these supercomputing machines exhibit computing performance achieving hundreds of teraflops.
One family of such massively parallel computers has been developed by the International Business Machines Corporation (IBM) under the name Blue Gene. Two members of this family are the Blue Gene/L system and the Blue Gene/P system. The Blue Gene/L system is a scalable system having over 65,000 compute nodes. Each node is comprised of a single application specific integrated circuit (ASIC) with two CPUs and memory. The full computer system is housed in sixty-four racks or cabinets with thirty-two node boards in each rack.
The Blue Gene/L computer system structure can be described as a compute node core with an I/O node surface, where communication to the compute nodes is handled by the I/O nodes. In the compute node core, the compute nodes are arranged into both a logical tree structure and a multi-dimensional torus network. The logical tree network connects the compute nodes in a tree structure so that each node communicates with a parent and one or two children. The torus network logically connects the compute nodes in a three-dimensional lattice like structure that allows each compute node to communicate with its closest 6 neighbors in a section of the computer.
In the Blue Gene/Q system, the compute nodes comprise a multidimensional torus or mesh with N dimensions and that the I/O nodes also comprise a multidimensional torus or mesh with M dimensions. N and M may be different, for scientific computers, typically N>M. Compute nodes do not typically have I/O devices such as disks attached to them, while I/O nodes may be attached directly to disks, or to a storage area network.
Each node in a D dimensional torus has 2D links going out from it. For example, the BlueGene/L computer system (BG/L) and the BlueGene/P computer system (BG/P) have D=3. The I/O nodes in BG/L and BG/P do not communicate with one another over a torus network. Also, in BG/L and BG/P, compute nodes communicate with I/O nodes via a separate collective network. To reduce costs, it is desirable to have a single network that supports point-point, collective, and I/O communications. Also, the compute and I/O nodes may be built using the same type of chips. Thus, for I/O nodes, when M<N, this means simply that some dimensions are not used, or wired, within the I/O torus. To provide connectivity between compute and I/O nodes, each chip has circuitry to support an extra bidirectional I/O link. Generally this I/O link is only used on a subset of the compute nodes. Each I/O node generally has its I/O link attached to a compute node. Optionally, each I/O node may also connect it's unused I/O torus links to a compute node.
In BG/L, point-to-point packets are routed by placing both the destination coordinates and “hint” bits in the packet header. There are two hint bits per dimension indicating whether the packet should be routed in the plus or minus direction; at most one hint bit per dimension may be set. As the packet routes through the network, the hint bit is set to zero as the packet exits a node whose next (neighbor) coordinate in that direction is the destination coordinate. Packets can only move in a direction if its hint bit is set in that direction. Upon reaching its destination, all hint bits are 0. On BG/L, BG/P and BG/Q, there is hardware support, called a hint bit calculator, to compute the best hint bit settings for when packets are injected into the network.