High performance computers (some of which are referred to as supercomputers) are used to perform complex calculations. Such computers are typically used to run applications which perform extremely large numbers of calculations and/or process very large data sets. A high performance computer is required where a common computer cannot complete the required calculations in a practical interval. For example, a 7-day weather forecast is not useful if it takes 8 days of calculation to generate.
There are a wide variety of scientific and commercial applications that can benefit from the ability of high performance computers to rapidly perform vast numbers of calculations. Some examples of such applications include performing simulations of natural phenomena, data mining, engineering analysis of complicated structures, crash simulations, weather forecasting, cryptanalysis, and so on.
In general, high performance computers can be made faster by one or both of two basic design strategies: use faster CPU(s) or use more CPUs. Because of the relatively small market for high performance computers it is generally not cost effective to design and build unusually fast CPUs for use in high performance computers. Designing and testing a CPU is a very expensive undertaking that can typically be justified only if there are prospects for selling large quantities of the CPU. Further, device physics limitations make dramatic advances in CPU performance difficult to achieve.
Current design strategies for high performance computers involve providing large numbers of CPUs operating in parallel. The CPUs can be commodity off-the-shelf CPUs. FIG. 1 shows a computer system 10 comprising a number of compute sub-systems 11 connected to a shared communication network 14. Each compute sub-system 11 comprises a CPU 12 and memory 13.
There are numerous possible variations on the basic design of system 10. For example, in some cases each compute sub-system 11 comprises more than one CPU 12 that share memory 13. Such computer sub-systems are referred to as Symmetric Multi-Processor (SMP) systems.
The way in which compute sub-systems are packaged for purchase and installation can be important to purchasers of such systems. FIG. 2 illustrates a computer system 10A wherein, two or more compute sub-systems 11 are packaged in separate chassis 20. Like other computers, high performance computers typically have mass storage to hold data. There are many possible ways of attaching mass storage, such as disk storage to a computer system as shown in one of FIGS. 1 and 2.
The topology of shared communication network 14 plays a significant role in the performance of a computer system as shown in FIG. 1 or 2. Network 14 may provide both internal communication (i.e. communication between compute sub-systems 11) and external communication (i.e. communication between a compute sub-system 11 and some computer or device external to computer system 10) or separate networks may be provided for internal and external communications.
External communication is predominantly implemented using a local area network (LAN). If the LAN is connected via a router to the Internet, it is also considered to be a component of a wide area network (WAN). It is common for LANs to be implemented using TCP/IP over Ethernet.
Internal communication might be administrative in nature, or it might represent communication internal to a high performance computing application. Administrative communication has no special requirements, so LAN technology would be an appropriate solution for this type of communication. The internal communication of a high performance computing application requires further consideration.
High performance computing applications are designed to split up a large complex problem into a multiplicity of smaller sub-problems that can be allocated to the available compute sub-systems 11. Some of these applications are considered to be “embarrassingly parallel”, because each sub-problem can be solved independently. Little or no communication is required between the instances of application software solving the various sub-problems. Image rendering of computer-generated movie frames is one example of an “embarrassingly parallel” application, because each frame can be rendered independently. Such applications place no special requirements on shared communication network 14. Cluster computers (a sub-category of high performance computers) with TCP/IP over Ethernet networks are often used to solve “embarrassingly parallel” problems.
In the case of other high performance computing applications, it is not possible to solve the sub-problems independently. Moderate to extensive communication is required to solve the sub-problems. For example, a stellar motion simulation of the 108 stars inside the Milky Way galaxy over the course of 5 billion years might be done by allocating the sub-problem of calculating the motions of 104 stars to each of the 10,000 CPUs in a high performance computer. Since the motion of each star is determined by the total gravitational attraction of all the other stars in the galaxy it is necessary for the CPUs to exchange stellar location data periodically so that gravitational forces can be calculated. This could require on the order of 5×1014 to 5×108 messages to be communicated over a single execution of the application. Such an application requires shared communication network 14 to handle large volumes of messages within a reasonable time interval. This requires network 14 to have a large bandwidth.
The topology of the communication network 14 in a high performance computer may provide multiple paths between any given source and destination. Communication network 14 may be configurable to support load balancing message traffic across the multiple paths.
Apart from measuring the bandwidth of individual links, the bandwidth of network 14 may be characterized by a metric called the minimum bi-section bandwidth. A bi-section of a network is a division of the network into two halves, each containing half of the compute sub-systems 11. In a worst case scenario where all of the senders are in one half of the network and the receivers are in the other half of the network, all communication crosses the bi-section. The bi-section bandwidth is the aggregate bandwidth of all the data links that cross the bi-section. If we allow the bi-section to be arbitrarily placed, the minimum bi-section bandwidth arises for the bi-section with the smallest bi-section bandwidth.
The performance of a multi-processor computer system is limited by the communication latency, which is the time it takes to send a message between any two CPUs. Communication latency has two components: the inherent latency of the communication network 14 when no other message traffic is present, and the additional delays incurred by the presence of other message traffic. The inherent latency of communication network 14 depends upon factors including: the topology of network 14 (i.e. the number of switches and data links that must be traversed between sender and receiver), the latency to transmit a message from the sender, the latency to pass a message from one end of a data link to the other, the latency to pass a message through intervening switches, and the latency to accept a message at the receiver.
There are many applications in which performance is determined by the longest time taken for one CPU in the computer system to deliver a result to another CPU. As a result, high communication latency between a pair of CPUs can slow down an entire application. Unlike other communication networks and the applications that operate on those networks, the performance of high performance computing applications is determined by the maximum latency of the communication network 14, not the average latency.
The performance of applications which solve problems iteratively can be limited by maximum latency. For example, the above stellar motion simulation might require processors to exchange stellar positions and velocities with other processors at the end of each simulated 1000 year interval. It is most convenient to exchange stellar position data after the position of each star has been calculated as of the end of the interval currently being considered. Like a motion picture film, the evolution of the Milky Way may be simulated by determining stellar positions for each of a succession of 5×106 discrete times separated by 1000 year intervals.
If the simulation is being run on a computer system having 10,000 CPUs then any single CPU may need to receive stellar position data from the other 9,999 CPUs before it can start the calculations for the next interval. The CPU is forced to wait to receive the last to come of the stellar position data from the other CPUs. The delay in receiving the last data is due to two factors: the time taken by the responsible CPU to calculate the last data and the time taken to communicate that data to the other CPUs. This constraint on performance is difficult to avoid when an application computes iteratively with communication between iterations. Many high performance computing applications exhibit this behavior.
In general, high performance computers will benefit from having a communication network 14 for internal communication that has a large minimum bi-section bandwidth and small maximum latency.
The topology of communication network 14 has a major impact on communication latency. The topology determines the number of data links and switches that a message has to pass through between a particular sender and a particular receiver. The number of hops to get through the communication network 14 acts as a multiplier on individual data link and switch latencies in calculating the overall latency.
The topology of communication network 14 also has a secondary impact on bandwidth. The bandwidth of individual data links is a primary influence, but the topology also has an influence by providing multiple paths between senders and receivers. Data can be transmitted more quickly if it can be split between multiple paths.
There has been substantial research on the optimum topologies for high performance computing. Many topologies have been invented. These include:
direct connection (sometimes referred to as fully meshed);
bus;
star;
ring (including chordal ring);
various types of multistage interconnect network (MIN);
hypercube (generally known as k-ary n-cube);
mesh (generally known as k-ary n-cube);
toroid (also generally known as k-ary n-cube); and,
fat tree.
Most of these topologies are not currently used for high performance computing. The direct connection, bus, star, and ring topologies cannot be practically scaled up to handle communications between large numbers of CPUs. Hypercubes and multistage interconnect networks have been used in the past, but are currently out of fashion. A toroid network has advantages over a mesh network and does not cost significantly more to make, so mesh networks are seldom seen. Fat trees and toroids are the two predominant network topologies currently being used in high performance computers.
The fat tree topology is described in C. E. Leiserson Fat-Trees: Universal Networks for Hardware Efficient Supercomputing, IEEE Transactions on Computers Vol. C-34, No. 10, pp. 892-901; October 1985. FIG. 3 shows a computer system 30 having nodes 31 interconnected by a network 14 configured as a fat tree. Network 14 is represented as an upside down tree with nodes 31 at the leaves, data links 32 as the branches, and switches 33 where the branches join. A topmost switch 34 is the root of the tree.
Fat trees were designed originally for use in networks implemented inside VLSI chips. In such networks a fat tree can be implemented using variable bandwidth data links. Each switch 33 has a single data link connected to its top side. The bandwidth of the top side data link is at least the sum of the bandwidths of the data links connected to the bottom side of the switch. At each level of the tree, the aggregate bandwidth of all data links is the same as at all other levels. As one goes up the tree, the bandwidth of individual data links increases.
While a fat tree with variable bandwidth data links can be readily implemented in an integrated circuit, it cannot be cost-effectively implemented at the macro scale where switches are complete systems and data links are wire cables. Too many variants of switch design are required. At the scale of normal network equipment, it is more cost-effective for all data links to have the same bandwidth.
This has led to the variant of the fat tree network shown in FIG. 4. Switches at each level are duplicated and additional data links are added. All of the data links can have the same bandwidth. The aggregate bandwidth at each level of the tree still remains the same as at any other level of the tree. The duplication of switches has the side effect that there is the same number of switches at each level.
FIG. 5 shows another example of a fat tree network. Eight port switches are used in a three level fat tree to connect 64 nodes together. All data links have the same bandwidth.
If the switches used to construct a fat tree network have SP ports, a single layer fat tree can be constructed that connects together up to SP/2 nodes. A two layer fat tree can be constructed that connects together up to (SP/2)2 nodes. In general, a fat tree network with L layers can connect together up to (SP/2)L nodes. Conversely, a fat tree network constructed to connect together N nodes must have at
  least  ⁢          ⁢      ⌈                  log                  SP          2                    ⁡              (        N        )              ⌉    ⁢          ⁢      layers    .  
The maximum latency occurs in a fat tree network when a packet must travel all the way to a switch 34 at the top of the fat tree before it can descend to the destination node. This requires the packet to pass through every layer twice except for the top layer. The maximum latency for a fat tree network is thus
  2  ×      ⌈                  log                  SP          2                    ⁡              (        N        )              ⌉  hops.
The minimum bi-section bandwidth of a fat tree occurs for the bi-section which separates the fat tree into a left half and a right half. If there are N nodes connected by a fat tree network, and each is connected to the network by a single data link, the minimum bi-section bandwidth is (N/2)×linkBW, where linkBW is the bandwidth of each link.
In a mesh network, compute sub-systems are interconnected in an D-dimensional grid. FIG. 6 shows a 6×7 2-dimensional mesh network 60 in which nodes 61 are interconnected by data links 62. Each node 61 is connected to its nearest-neighbor nodes. In high performance computer systems, mesh networks are usually designed with equal length sides in order to minimize the maximum latency.
A toroid network is a mesh network that includes supplementary wrap-around data links connecting corresponding nodes on opposite sides of the topology. FIG. 7 shows a 6×6 2-dimensional toroid network 70. FIG. 8 shows a 2×2×2 3-dimensional toroid network 80. In a toroid network, the nodes 81A and 81B that are directly opposite each other on opposite sides of the network are joined together by a wrap-around data link 82. This is done for all possible directly opposite pairs of nodes.
A toroid constructed from a mesh having D dimensions with sides each having length n, can be used to connect together nD nodes. The individual nodes in a mesh or toroid network can be referenced by assigning them a coordinate based on a Cartesian coordinate system. If the mesh or toroid has D dimensions, each node can be assigned a coordinate with D components. For example a Cartesian coordinate system based on x, y, and z components can uniquely identify each node of a mesh or toroid having 3 dimensions. In this disclosure, numbering starts at 1 for each dimension. If the mesh or toroid has equal length n in each dimension then the numbering of each dimension ranges from 1 to n.
The relative positions of two nodes in a mesh or toroid can be expressed as differences between the coordinates of the two nodes in each dimension. For example, in a 3-dimensional mesh or toroid, one can use the notation Δx1, Δx2, and Δx3 to refer to the differences in position in the three dimensions respectively.
The data links of a mesh or toroid network can be referred to as Cartesian data links, because they follow the grid lines of a Cartesian coordinate system to which the nodes are mapped. A Cartesian data link connects two nodes whose coordinates are different in only one of the D dimensions.
A toroid is a switchless communication network. Nodes are connected directly to one another. This requires that each node have a minimum of 2×D ports to which data links can be connected.
In a mesh network the maximum latency occurs when a sender and receiver are at opposite ends of the longest diagonal of the mesh. The maximum latency of a mesh network is D×(n−1) or
  D  ×      (                  N                  1          D                    -      1        )  hops.
In a toroid, the maximum latency also occurs on the diagonals but the maximum latency is approximately half that of a similar mesh network, because the wrap-around data links provide a second alternative path for reaching the destination.
The minimum bi-section bandwidth for a toroidal network is experienced when a toroid is bisected into a left half and a right half. If only one data link connects adjacent nodes, the minimum bi-section bandwidth is 2×nD-1×linkBW. The factor of 2 is due to the wrap-around data links.
FIG. 9 plots maximum latency for both fat trees and toroidal networks as a function of the size of the number of nodes in a computing system. Latency is measured as the number of hops. In both networks the traversal of a data link is considered to be a hop. The maximum latency of a fat tree network depends on both the size of the computing system and the size of the switches being used in terms of switch ports. Curve 93 is for a fat tree constructed using switches each having 8 ports. Curve 94 is for a fat tree constructed using switches having 32 ports.
The maximum latency of a toroid network depends on both the size of the computing system and the number of dimensions of the toroid. Curves 90, 91 and 92 are respectively for toroids having 2, 3 and 4 dimensions. It can be seen that toroids generally have a higher maximum latency than fat tree networks. Maximum latency can only be reduced in a toroid network by using more dimensions. If enough dimensions are used, the maximum latency of a toroid network can be reduced to levels similar to a fat tree. Unfortunately the use of additional dimensions does not really help in practice, because many applications map most naturally to a toroid with 3 dimensions.
FIG. 10 plots minimum bi-section bandwidth as a function of the number of nodes in a computing system. Bandwidth is measured as the number of data links crossing the bi-section. The number of data links can be converted to an actual bandwidth in bits per second by multiplying the number of data links by the bandwidth in bits per second of each data link.
The minimum bi-section bandwidth of a fat tree only depends on the size of the computing system. Curve 100 shows the minimum bi-section bandwidth for a fat tree. The minimum bi-section bandwidth of a toroid network depends on both the size of the computing system and the number of dimensions of the toroid. Curves 101, 102 and 103 are respectively for toroids having 4, 3 and 2 dimensions.
It can be concluded from a comparison of fat tree networks with toroid networks using FIGS. 9 and 10 that a fat tree generally provides superior performance relative to a toroid in terms of lower maximum latency and greater minimum bi-section bandwidth. Despite this, toroidal network topologies are used frequently because many applications model physical phenomena in a 3-dimensional continuum. In such applications, a majority of data communications is between nearest-neighbor nodes. In a toroid, nearest-neighbor nodes are always directly connected. Such applications, which may be called nearest-neighbor applications map well to toroidal topologies.
In contrast, nearest-neighbor applications do not map conveniently to fat tree networks. Nearest-neighbor applications typically treat a number of points arranged in a D-dimensional space. The points interact primarily with their nearest-neighbors. Such applications typically map each of the points to a node associated with a processor or group of processors so that calculations affecting the different points may be carried on in parallel. It is impossible to map a large number of points onto nodes without crossing a boundary in the fat tree network that forces network traffic to rise upwards to higher layers of the fat tree. For example, in FIG. 5 it is impossible to allocate a nearest-neighbor application to more than 32 nodes without some of the nodes being on opposite sides of the central divide 50. Communication passing across the central divide of a fat tree network must transit through the uppermost switch layer to reach the other side of the fat tree network. In FIG. 5, this means that some network traffic incurs a latency of 6 hops. As has been mentioned previously, high performance applications tend to run at the speed of the slowest components. With all other factors being equal, the extra latency required to pass network traffic across central divide 50 will reduce the performance of the application.
It can be concluded that for arbitrary patterns of communication fat tree networks offer superior performance to toroid networks due to their lower maximum latency and greater minimum bi-section bandwidth. In the more restricted domain of nearest-neighbor applications, toroids provide superior performance.
Most high performance computing sites run a mix of high performance applications. A significant proportion of these applications are nearest-neighbor applications that benefit from a toroid network. A significant proportion of the remainder require patterns of communication that benefit from a fat tree network.
There remains a need for computer systems which can run effectively a wide range of applications. There is a particular need for such computer systems which combine desirable characteristics of both fat tree and toroid network topologies.